* add flux support
* avoid build failures in non-CUDA environments
* fix schnell support
* add k quants support
* add support for applying lora to quantized tensors
* add inplace conversion support for f8_e4m3 (#359)
in the same way it is done for bf16
like how bf16 converts losslessly to fp32,
f8_e4m3 converts losslessly to fp16
* add xlabs flux comfy converted lora support
* update docs
---------
Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>