r/LocalLLaMA
·
23h ago
·
7
·
inference
optimization
cuda
open source
benchmark
Discussion of FWHT (Fast Walsh-Hadamard Transform) CUDA kernel implementation for quantized KV-cache in LLM inference, with performance benchmarks across different model architectures and head sizes. Shows practical optimization work for inference speed-ups when using q8_0 quantization on different GPU architectures (RTX 5090, CDNA).