News Nug

VultronRetriever family of models released on HuggingFace![R]

r/MachineLearning · 2h ago · 8 · new model open source embedding inference deployment benchmark

Vultr released the VultronRetriever family of open-source embedding models ranking #1 on MTEB leaderboard, with three size variants (8B Prime, 4.5B Core, 0.8B Flash) optimized for inference efficiency and edge deployment including offline iPhone execution. The models demonstrate significant improvements in speed, storage footprint, and performance-per-parameter with the novel Hydra Architecture enabling late interaction retrieval at reduced memory costs.

Nostalgia for Bloom

r/LocalLLaMA · 19h ago · 7 · tool tutorial prompt engineering inference

Guide to using BLOOMZ-P3, a multilingual instruction-following model fine-tuned on 46 languages, with practical examples showing how to deploy it via Transformers, vLLM, SGLang, and Docker. Includes prompt engineering best practices for optimizing zero-shot task performance across languages, emphasizing clear prompt structure and contextual framing.

tencent/HiLS-Attention-7B · Hugging Face

r/LocalLLaMA · 1d ago · 8 · new model inference open source tool

HiLS-Attention-7B is a new sparse attention mechanism that enables efficient long-context modeling by learning chunk selection end-to-end, achieving strong extrapolation beyond 4× training length while maintaining performance on standard tasks. The model is available with integration guides for Transformers, vLLM, SGLang, and Docker, though it requires custom setup through the official GitHub repository rather than standard AutoModel loading.

Profiling in PyTorch (Part 3): Attention is all you profile

HuggingFace Blog · 1d ago · 8 · tutorial workflow inference

Deep dive into profiling attention mechanisms in PyTorch using the profiler to understand kernel execution, memory operations, and optimization techniques. Part 3 of a series covering naive attention, in-place operations, scaled dot-product attention (SDPA), and custom kernels with practical profiling traces and optimization patterns.

Koboldcpp v1.117 released

r/LocalLLaMA · 1d ago · 6 · deployment open source inference

KoboldCpp release notes covering deployment options across different hardware (NVIDIA, AMD, CPU, Apple Silicon) and API connectivity for running quantized language models locally. Notable breaking change: --splitmode row in CUDA removed, requiring migration to tensor or layer split approaches.

GLM-5.2 (744B MoE) on a 25GB-RAM consumer machine

r/LocalLLaMA · 1d ago · 8 · tool inference open source deployment

colibrì is a pure C inference engine that runs GLM-5.2 (744B MoE model) on consumer hardware (~25GB RAM) by streaming experts from disk, activating only ~40B parameters per token. The implementation leverages MoE sparsity and disk I/O optimization to achieve frontier-class model inference without GPU dependency, with automatic expert pinning that improves performance over time.

I built IMGNet – a face verification model that identifies people using sign patterns, not cosine similarity [R]

r/MachineLearning · 1d ago · 7 · research benchmark open source inference

Independent researcher presents IMG Sign Score, a novel face verification approach replacing cosine similarity with sliding window sign pattern matching, achieving 96.27% on LFW with a compact 10.58 MB model trained on CASIA-WebFace. The method introduces SW Block convolution and IMG Sign MSE loss operating purely on sign pattern agreement, with code and model weights publicly available on GitHub and Hugging Face.

Talos-XII: hand-written autograd + small RL/MLP stack in Rust, applied to gacha probability modeling (no tch-rs/ndarray/PyTorch) — looking for benchmark help on ARM/AVX-512/GPU [P]

r/MachineLearning · 2d ago · 7 · research open source inference library

Talos-XII is a hand-written ML systems project in Rust that trains neural networks (EnvNet, DQN, PPO) to model gacha probability dynamics, featuring a custom autograd engine, SIMD dispatch (AVX2/AVX-512/NEON), and an experimental adaptive caching component (ACHF) for CPU-bound RL inference. The project demonstrates practical systems engineering for embedded ML—custom autodiff, parallelization, and BF16 optimization—though the core innovation (ACHF) is still experimental and lacks cross-hardware validation.

OpenMOSS-Team/MOSS-Transcribe-Diarize · Hugging Face

r/LocalLLaMA · 2d ago · 8 · tool tutorial inference deployment

MOSS-Transcribe-Diarize 0.9B is a practical end-to-end model for multi-speaker audio transcription and diarization in a single pass, with native Transformers support via custom remote code. The tutorial covers practical deployment options including vLLM and SGLang Omni serving with OpenAI-compatible APIs, plus prompt engineering for hotwords and optimization strategies for long-form audio.

Step 3.7 Flash IQ4_XS GGUF with preserve_thinking

r/LocalLLaMA · 2d ago · 7 · inference tutorial open source tool

Comprehensive guide for running the Step-3.7-Flash GGUF quantization across multiple inference frameworks (llama.cpp, vLLM, Ollama, etc.) with custom IQ4_XS quantization and a preserve_thinking chat template feature that maintains reasoning state across turns.

[AINews] SpaceXAI launches Grok 4.5, first Opus-class model post Cursor acquisition

Latent Space · 2d ago · 7 · new model tool api update agent inference

Grok 4.5, a new frontier model from xAI trained specifically for coding and agents, launched with Cursor partnership offering Opus-class performance at better speed, cost efficiency, and token efficiency. The model is positioned for practical engineering workflows rather than benchmark supremacy, with immediate availability across Cursor, Grok API, OpenRouter, and agent frameworks like Hermes.

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

Latent Space · 2d ago · 7 · deployment inference tool agent

Modal raised $355M Series C and is positioning itself as an "agent cloud" platform optimized for AI workloads rather than traditional web applications, with features like serverless functions, elastic inference, GPU snapshotting, sandboxes, and multi-node training. The podcast episode with Modal's CTO covers why traditional cloud infrastructure (Kubernetes) wasn't designed for bursty AI compute, why agents need tighter infrastructure abstractions, and Modal's technical stack including speculative decoding, Auto Endpoints, and capacity pooling across 17 cloud providers.

LingBot-Video: sparse-MoE video diffusion transformer (13B total, 1.4B active) post-trained as an action-conditioned world model[R]

r/MachineLearning · 2d ago · 7 · new model open source research rl training inference

LingBot-Video combines diffusion transformers with DeepSeek-V3-style sparse MoE (128 experts, top-8 routing) and multi-reward RL post-training for action-conditioned video generation, with open weights/code in Diffusers/SGLang. Key technical tensions: using VLMs as physics validators may enable reward hacking despite negative examples, and unclear separation between video generation and true world modeling without closed-loop robot validation.

DINOv2 way worse than SigLIP in k-NN. Is this expected? [R]

r/MachineLearning · 3d ago · 6 · research benchmark inference

A developer shares empirical results comparing vision encoders (SigLIP2, CLIP ViT-L, DINOv2) for fine-grained car classification via k-NN retrieval, observing a 50-point accuracy gap between SigLIP2 (92%) and DINOv2 (41%). The post explores whether this is due to embedding space design differences and questions whether DINOv2 needs supervised fine-tuning to be effective for retrieval tasks on small datasets.

novita/kimi-k2.6-dspark · Hugging Face

r/LocalLLaMA · 3d ago · 7 · inference open source tool

DSpark is a speculative decoding implementation for Kimi-K2.6 that accelerates inference through lightweight draft heads (Markov logit-bias and confidence prediction), achieving 0.816 acceptance rate at position 1. The approach uses vLLM's speculative decoding and shows good cross-model transfer to K2.7-Code with only ~11% degradation in accepted length.

Native-speed vLLM transformers modeling backend

HuggingFace Blog · 3d ago · 8 · inference tool library deployment open source

The transformers library's vLLM integration now uses torch.fx graph analysis and AST-based code rewriting to dynamically optimize model inference at runtime, matching native vLLM performance without custom implementations. This enables single-flag deployment of Hugging Face models with optimized inference (continuous batching, fused kernels) through --model-impl transformers, with benchmark comparisons showing performance parity across Qwen3 variants.

Learning FlashAttention the Hard Way. Part 1: The Algebraic Foundation [D]

r/MachineLearning · 3d ago · 8 · tutorial research inference

A theoretical tutorial series on FlashAttention using modern algebraic formalism (associative reductions, twisted monoids) that enables GPU scheduling optimizations—more powerful than the original framing. Covers safe softmax, Welford's variance, numerical stability bounds, and provides first-principles derivations of constants in FA-2 and Triton kernels.

Unsloth has uploaded several sizes of Deepseek-V4-Flash GGUF's

r/LocalLLaMA · 3d ago · 8 · new model inference tool deployment

Practical guide for running DeepSeek-V4-Flash GGUF quantized model across multiple inference frameworks (llama.cpp, Ollama, llama-cpp-python, etc.), including critical bug fix for llama.cpp PR #25402 that resolves gibberish output after turn 2 and improved chat template handling.

Gepard : 0.6B streaming TTS built for real-time dialogue - 20× realtime factor, ~50ms time-to-first-audio, vLLM-native, Apache 2.0

r/LocalLLaMA · 4d ago · 8 · new model tool inference api update

Gepard-1.0 is a streaming text-to-speech model optimized for real-time dialogue and voice agents, built on Qwen3-0.8B with NVIDIA NanoCodec for low-latency audio generation. The model generates speech incrementally as text arrives, delivering natural prosody and supporting zero-shot voice cloning, making it practical for conversational AI applications where latency matters more than perfect speaker matching.

agent-chief — Attention is your scarcest resource. Chief is the local-first layer that guards it — turning every agent, alert, and feed into one honest call: interrupt, or not.

GitHub Trending AI · 7d ago · 7 · agent open source workflow inference

Chief is an open-source agent orchestration tool that sits between your systems and AI agents to filter, batch, and prioritize notifications/events with deterministic rules and LLM-based judgment. It demonstrates practical patterns for reducing LLM token spend (~$0.10 per 1k events with prompt caching) and preventing alert fatigue through learned per-topic routing policies trained on user feedback signals.