News Nug

Predicting human preference for generated image pairs using HPSv3 [P]

r/MachineLearning · 9h ago · 6 · tool benchmark

A developer working on ImageBench.ai shares their experience with HPSv3 for predicting human image preferences and asks for recommendations on alternative human preference models. This is a practical engineering question about evaluating preference prediction tools for image generation workflows.

Nostalgia for Bloom

r/LocalLLaMA · 19h ago · 7 · tool tutorial prompt engineering inference

Guide to using BLOOMZ-P3, a multilingual instruction-following model fine-tuned on 46 languages, with practical examples showing how to deploy it via Transformers, vLLM, SGLang, and Docker. Includes prompt engineering best practices for optimizing zero-shot task performance across languages, emphasizing clear prompt structure and contextual framing.

tencent/HiLS-Attention-7B · Hugging Face

r/LocalLLaMA · 1d ago · 8 · new model inference open source tool

HiLS-Attention-7B is a new sparse attention mechanism that enables efficient long-context modeling by learning chunk selection end-to-end, achieving strong extrapolation beyond 4× training length while maintaining performance on standard tasks. The model is available with integration guides for Transformers, vLLM, SGLang, and Docker, though it requires custom setup through the official GitHub repository rather than standard AutoModel loading.

GLM-5.2 (744B MoE) on a 25GB-RAM consumer machine

r/LocalLLaMA · 1d ago · 8 · tool inference open source deployment

colibrì is a pure C inference engine that runs GLM-5.2 (744B MoE model) on consumer hardware (~25GB RAM) by streaming experts from disk, activating only ~40B parameters per token. The implementation leverages MoE sparsity and disk I/O optimization to achieve frontier-class model inference without GPU dependency, with automatic expert pinning that improves performance over time.

Introducing Muse Spark 1.1

Simon Willison · 2d ago · 9 · new model api update tool agent

Meta released Muse Spark 1.1 with a new API and claimed improvements in agentic tool calling and computer use capabilities. The post includes a new LLM CLI plugin (llm-meta-ai) for programmatic access to the model, making it immediately useful for engineers building with AI.

OpenMOSS-Team/MOSS-Transcribe-Diarize · Hugging Face

r/LocalLLaMA · 2d ago · 8 · tool tutorial inference deployment

MOSS-Transcribe-Diarize 0.9B is a practical end-to-end model for multi-speaker audio transcription and diarization in a single pass, with native Transformers support via custom remote code. The tutorial covers practical deployment options including vLLM and SGLang Omni serving with OpenAI-compatible APIs, plus prompt engineering for hotwords and optimization strategies for long-form audio.

Step 3.7 Flash IQ4_XS GGUF with preserve_thinking

r/LocalLLaMA · 2d ago · 7 · inference tutorial open source tool

Comprehensive guide for running the Step-3.7-Flash GGUF quantization across multiple inference frameworks (llama.cpp, vLLM, Ollama, etc.) with custom IQ4_XS quantization and a preserve_thinking chat template feature that maintains reasoning state across turns.

[AINews] SpaceXAI launches Grok 4.5, first Opus-class model post Cursor acquisition

Latent Space · 2d ago · 7 · new model tool api update agent inference

Grok 4.5, a new frontier model from xAI trained specifically for coding and agents, launched with Cursor partnership offering Opus-class performance at better speed, cost efficiency, and token efficiency. The model is positioned for practical engineering workflows rather than benchmark supremacy, with immediate availability across Cursor, Grok API, OpenRouter, and agent frameworks like Hermes.

GPT-5.6 Thursday ⭐️, Claude Cowork mobile 📱, Gemini API agents 🤖

TLDR AI · 2d ago · 8 · agent api update tool workflow

MiniMax Code offers a practical AI platform for building multi-step agents with 1M token context window, native vision capabilities, and competitive pricing ($500/year for 5.1B tokens). Enables developers to create reasoning agents, visual document processing, and codebase analysis workflows without external vision models.

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

Latent Space · 2d ago · 7 · deployment inference tool agent

Modal raised $355M Series C and is positioning itself as an "agent cloud" platform optimized for AI workloads rather than traditional web applications, with features like serverless functions, elastic inference, GPU snapshotting, sandboxes, and multi-node training. The podcast episode with Modal's CTO covers why traditional cloud infrastructure (Kubernetes) wasn't designed for bursty AI compute, why agents need tighter infrastructure abstractions, and Modal's technical stack including speculative decoding, Auto Endpoints, and capacity pooling across 17 cloud providers.

Agentic safety triggers aren't textual safety triggers — MCP attacks that beat SOTA guardrails more than half the time (code + dataset) [R]

r/MachineLearning · 2d ago · 8 · research agent fine tuning tool prompt engineering

Research demonstrates a critical gap in LLM safety alignment: current text-classification-based guardrails fail against adversarial prompts that encode attacks in tool-call sequences rather than linguistic markers. The study evaluates multiple safety approaches (DPO, SafeDPO, training-free methods) against CVE-based attacks on MCP-enabled agents, showing current SOTA methods only block ~48% of attacks while training-free approaches achieve 3x baseline refusal rates without fine-tuning.

Data for Agents

HuggingFace Blog · 3d ago · 7 · tool open source agent benchmark

NVIDIA released Nemotron Post-Training v3 Prompt Atlas, an interactive tool for exploring millions of synthetic post-training samples across diverse domains. The resource addresses a critical gap for AI engineers building agents: understanding the data composition and training recipes behind model behavior, with emphasis on how synthetic data enables agent reproducibility and inspectability without exposing proprietary datasets.

novita/kimi-k2.6-dspark · Hugging Face

r/LocalLLaMA · 3d ago · 7 · inference open source tool

DSpark is a speculative decoding implementation for Kimi-K2.6 that accelerates inference through lightweight draft heads (Markov logit-bias and confidence prediction), achieving 0.816 acceptance rate at position 1. The approach uses vLLM's speculative decoding and shows good cross-model transfer to K2.7-Code with only ~11% degradation in accepted length.

Introducing GPT-Live

OpenAI Blog · 3d ago · 8 · new model tool api update

OpenAI released advanced voice models that enable natural speech-based interaction with ChatGPT, supporting real-time conversation with improved naturalness and responsiveness. This represents a significant tool update for AI engineers building voice-enabled applications and multimodal interfaces.

Native-speed vLLM transformers modeling backend

HuggingFace Blog · 3d ago · 8 · inference tool library deployment open source

The transformers library's vLLM integration now uses torch.fx graph analysis and AST-based code rewriting to dynamically optimize model inference at runtime, matching native vLLM performance without custom implementations. This enables single-flag deployment of Hugging Face models with optimized inference (continuous batching, fused kernels) through --model-impl transformers, with benchmark comparisons showing performance parity across Qwen3 variants.

Unsloth has uploaded several sizes of Deepseek-V4-Flash GGUF's

r/LocalLLaMA · 3d ago · 8 · new model inference tool deployment

Practical guide for running DeepSeek-V4-Flash GGUF quantized model across multiple inference frameworks (llama.cpp, Ollama, llama-cpp-python, etc.), including critical bug fix for llama.cpp PR #25402 that resolves gibberish output after turn 2 and improved chat template handling.

From Hugging Face to Amazon SageMaker Studio in one click

HuggingFace Blog · 3d ago · 6 · workflow deployment fine tuning tool

AWS SageMaker Studio now integrates one-click model imports from Hugging Face with auto-provisioned domains and pre-configured IAM permissions, eliminating setup friction for fine-tuning and deployment workflows. The integration includes new managed policies for model customization (SFT, DPO, RLVR, RLAIF) and real-time GPU quota visibility to streamline the path from discovery to enterprise deployment.

Gepard : 0.6B streaming TTS built for real-time dialogue - 20× realtime factor, ~50ms time-to-first-audio, vLLM-native, Apache 2.0

r/LocalLLaMA · 4d ago · 8 · new model tool inference api update

Gepard-1.0 is a streaming text-to-speech model optimized for real-time dialogue and voice agents, built on Qwen3-0.8B with NVIDIA NanoCodec for low-latency audio generation. The model generates speech incrementally as text arrives, delivering natural prosody and supporting zero-shot voice cloning, making it practical for conversational AI applications where latency matters more than perfect speaker matching.

github-code Web Component

Simon Willison · 4d ago · 6 · tool tutorial open source

Simon Willison demonstrates building a Web Component for embedding GitHub code snippets using GPT-5.5, which fetches raw GitHub URLs and displays specified line ranges with line numbers. This showcases practical LLM-assisted web development for creating reusable components, though it's more of a creative experiment than a production tool or framework advancement.

sqlite-utils 4.0

Simon Willison · 4d ago · 7 · tool library workflow

sqlite-utils 4.0 introduces database schema migrations, a practical feature for developers managing evolving data structures in SQLite-backed applications. This is particularly useful for AI engineers building data pipelines, RAG systems, or applications that need reliable database versioning alongside their model workflows.