* add taesd implementation
* taesd gpu offloading
* show seed when generating image with -s -1
* less restrictive with larger images
* cuda: im2col speedup x2
* cuda: group norm speedup x90
* quantized models now works in cuda :)
* fix cal mem size
---------
Co-authored-by: leejet <leejet714@gmail.com>