News Nug

KVarN: Variance-Normalized KV-Cache Quantization [R]

r/MachineLearning · 8d ago · 9 · inference optimization open source benchmark research

KVarN is a novel KV-cache quantization method combining Hadamard rotations with variance normalization that achieves 3-4x compression with minimal accuracy loss on demanding benchmarks like AIME24. The approach includes a vLLM implementation and demonstrates actual speedups over fp16 baselines, making it immediately applicable for optimizing inference in reasoning and code-generation workloads.

MiniMax dropped a new attention architecture. [N]

r/MachineLearning · 10d ago · 8 · new model inference optimization benchmark

MiniMax introduces Sparse Attention (MSA) achieving 1M token context windows with 4× speedup over Flash-Sparse-Attention through hardware-optimized memory access patterns that restructure KV-Q computation. The approach delivers dramatic performance gains (9× prefill, 15× decode speedup) while reducing per-token compute to 1/20th previous levels, enabling sustained long-horizon agent execution with native multimodality and coding capabilities.

StepFun 3.5 MTP by pwilkin · Pull Request #23274 · ggml-org/llama.cpp

r/LocalLLaMA · 10d ago · 7 · inference open source optimization

Technical discussion about MTP (Multi-Token Prediction) implementation for StepFun 3.5 model in llama.cpp, covering architecture differences, optimization tweaks (top-k tuning improving acceptance rates from 0.6 to 0.9), and bug fixes related to KV cache handling across multiple MTP layers. Achieves 18 tokens/sec vs 15 tokens/sec on CPU MOE testing.

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

r/LocalLLaMA · 11d ago · 7 · inference optimization open source

GitHub PR discussion about optimizing VRAM usage in llama.cpp by reserving logits space only for n_seqs when possible, saving 1.2GB+ on typical configs with draft decoding. The optimization targets the llama-context API and includes benchmarks on RTX 3080/5070 Ti showing persistent memory headroom improvements without impacting model inference quality.

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

r/LocalLLaMA · 18d ago · 7 · inference optimization cuda open source benchmark

Discussion of FWHT (Fast Walsh-Hadamard Transform) CUDA kernel implementation for quantized KV-cache in LLM inference, with performance benchmarks across different model architectures and head sizes. Shows practical optimization work for inference speed-ups when using q8_0 quantization on different GPU architectures (RTX 5090, CDNA).

[WIP] Gemma 4 MTP

r/LocalLLaMA · 23d ago · 7 · inference optimization open source

Pull request discussion on implementing MTP (Multi-token prediction) speculative decoding for Gemma 4 models in llama.cpp, achieving >2x speedup on dense models with caveats around hardware compatibility and multi-GPU support. The thread documents real-world performance testing across different GPU setups, revealing variable results depending on hardware configuration and noting current limitations like broken multi-GPU support and incompatibility with quantized KV cache.

llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp

r/LocalLLaMA · 26d ago · 7 · inference optimization open source

A performance optimization for multi-token prediction (MTP) in llama.cpp that reduces memory traffic during prompt processing by avoiding unnecessary logit copying, improving throughput by ~20-50% on various hardware (RTX 5090, MI50). This is a practical inference optimization that affects token generation speed for models using MTP, relevant for engineers optimizing LLM serving.

A First Comprehensive Study of TurboQuant: Accuracy and Performance

r/LocalLLaMA · 29d ago · 8 · inference benchmark research optimization

TurboQuant is a KV-cache quantization method that compresses to 3-4 bits during storage and dequantizes to BF16 for attention computation, offering significant GPU memory savings. This comprehensive benchmark study evaluates TurboQuant variants against FP8 baselines across four large models (30B-200B+) and realistic workloads, providing practical guidance for inference optimization and memory efficiency tradeoffs.

Llama.cpp MTP support now in beta!

r/LocalLLaMA · 39d ago · 8 · inference optimization open source tool

A Pull Request implementing Multi Token Prediction (MTP) head support in llama.cpp, enabling speculative decoding with ~2.5x speedup and 75% token acceptance rates on Qwen3.6 models. The implementation optimizes host-device data transfers and is designed to work with any MTP-capable model, with working examples and performance benchmarks provided.

Visualizing Loss Landscapes of Neural Networks [P]

r/MachineLearning · 45d ago · 7 · tool visualization optimization

Interactive browser-based tool for visualizing neural network loss landscapes using dimensionality reduction techniques from Li et al. (NeurIPS 2018), allowing users to experiment with different architectures (MLPs to ResNet-8) and optimizers to understand how they navigate high-dimensional optimization spaces. Provides practical intuition-building for understanding local minima geometry and optimizer behavior, though acknowledges limitations of 2D/3D projections for representing true high-dimensional surfaces.