r/MachineLearning · 5h ago · 9 · inference deployment open source

A monokernel inference engine optimized for AMD MI300X achieves 3,300 tokens/sec by mapping memory access patterns to physical die topology and GPU compute unit layout, eliminating kernel launch overhead. The technical approach demonstrates practical GPU architecture exploitation for latency-optimized LLM decoding without speculative decoding or quantization, with plans to scale to frontier MoE models.

r/LocalLLaMA · 6h ago · 6 · inference deployment

A GitHub PR discussion about optimizing VRAM usage in a language model implementation by reserving the KQ attention mask in f16 instead of f32, achieving 1.2GB savings at batch size 2048 and ~300MB at batch size 512. The optimization involves memory layout changes for the attention mask tensor and compute buffer allocation strategies in what appears to be the llama.cpp project.

r/MachineLearning · 9h ago · 8 · fine tuning research probe targeted open source

Research demonstrating that instruct-tuned LLMs internally distinguish correct from incorrect answers (0.76-0.88 AUROC) despite displaying uniform 99% confidence externally. The authors use LoRA fine-tuning on probe-extracted hidden state targets to align the model's expressed confidence with its internal knowledge, validated through activation patching experiments showing causal relationships (ρ=0.976) across 8 models (7B-70B). Code and pre-registration are publicly available.

Latent Space · 12h ago · 8 · new model agent workflow benchmark

Anthropic released Claude Opus 4.8 with improvements in agentic reasoning and long-horizon coding tasks, plus Dynamic Workflows feature in Claude Code enabling parallel subagent orchestration for large-scale tasks. The model shows SOTA performance on economically relevant benchmarks and maintains pricing parity with Opus 4.7.

r/LocalLLaMA · 14h ago · 8 · new model agent inference workflow

Step 3.7 Flash is a new efficient multimodal model optimized for agentic workflows, with improvements in code generation (+5% SWE-Bench Pro, +6.1% Terminal-Bench), tool use reliability, and cross-harness compatibility. Key features include visual understanding, web search enhancement, and an 'Advisor Mode' that escalates to larger models only at critical decision points, reducing inference costs while maintaining performance.

HuggingFace Blog · 14h ago · 8 · tutorial workflow inference benchmark

Comprehensive tutorial on profiling PyTorch models using torch.profiler, covering how to read trace files and identify performance bottlenecks in matrix operations and GPU kernels. Essential for engineers optimizing LLM inference and training loops, with practical examples using NVIDIA GPUs and step-by-step walkthroughs of profiler outputs.

Simon Willison · 14h ago · 9 · new model api update inference prompt engineering agent

Claude Opus 4.8 release brings improved honesty/uncertainty flagging (4x reduction in unsupported claims), mid-conversation system messages for better prompt caching in agentic loops, and lower prompt cache minimums (1,024 tokens down from 1,024). Same pricing as 4.7 ($5/$25 per million tokens) with January 2026 knowledge cutoff and 1M context window.

Simon Willison · 14h ago · 6 · tool api update

A note referencing the llm-anthropic library release used with Opus 4.8 to generate outputs. While it mentions a tool update and model version, the content lacks sufficient technical depth or details about the capabilities, changes, or implementation specifics that would be directly useful for daily AI engineering work.

r/MachineLearning · 17h ago · 5 · agent research benchmark

Call for papers for Social Sim'26 workshop focusing on LLM-based social simulations with emphasis on evaluation fidelity, validation, and agent modeling rather than just demos. Relevant for engineers building multi-agent systems and LLM applications, though primarily academic/research-oriented with a June 2026 deadline.

Simon Willison · 19h ago · 6 · tool workflow

A Markdown rendering tool with integrated SVG visualization for fenced code blocks, allowing developers to view both rendered diagrams and source code. Useful for documenting and visualizing LLM outputs, model logs, and technical workflows directly in Markdown.

Latent Space · 20h ago · 5 · agent workflow

Discussion on the evolution of AI coding agents from autocomplete tools to async background agents, with analysis of architectural patterns (local agents vs. orchestrated systems) and market dynamics around agent frameworks like LangGraph and managed solutions from Anthropic/Gemini. Includes insights on agent adoption levels and the shift toward background task execution with developer review loops.

Anthropic Blog · 21h ago · 9 · new model api update agent benchmark

Claude Opus 4.8 releases today with improved benchmarks across coding, reasoning, and agentic tasks, plus better honesty in flagging uncertainties. Key features include user-controllable effort levels on claude.ai, dynamic workflows in Claude Code for large-scale problems, and 3x cheaper fast mode (2.5× speed). The model shows 4x lower likelihood of allowing code flaws to pass unremarked compared to Opus 4.7.

r/MachineLearning · 21h ago · 8 · benchmark agent deployment research

AgingBench introduces a longitudinal deployment benchmark revealing that upgrading agent backbone models isn't straightforward—swapping Claude Sonnet 4.6 to Opus 4.7 in a coding agent actually decreased performance by ~15% due to how memory state evolves over long deployments. Memory management policy alone created a 4.5x spread in agent half-life, suggesting that model capability alone doesn't predict real-world agentic system performance over time.

r/MachineLearning · 22h ago · 8 · new model open source fine tuning deployment

Wall-OSS-0.5 is a new 4B Vision-Language-Action model using a gradient bridge approach where discrete action-token CE dominates VLM backbone updates while flow matching contributes ~5%, combined with Vision-Aligned RVQ tokenization for semantic grounding of action tokens and DMuon optimizer for distributed training. The release includes strong real-robot evaluations (82% on held-out deformable tasks zero-shot, 60.5% average after fine-tuning across 15 tasks) and open-source code, making it immediately relevant for practitioners building embodied AI systems.

r/LocalLLaMA · 22h ago · 9 · new model tool inference agent

LiquidAI released LFM2.5-8B-A1B, a new 8B parameter hybrid model optimized for on-device deployment with improved reasoning, tool use, and function calling capabilities. The model integrates with major inference frameworks (Transformers, vLLM, SGLang) and achieves 18.5K tokens/sec throughput, making it practical for production agentic workflows and personal assistant applications.

r/MachineLearning · 1d ago · 8 · tool library open source dataset benchmark

MONET is a new Apache 2.0 open-source image-text dataset with 104.9M high-quality samples curated from 2.9B images, accompanied by visualization tools, a retrieval system, and a T2I training codebase. This is a significant resource for engineers building multimodal AI systems, offering both the dataset and practical tooling for training text-to-image models.

r/LocalLLaMA · 1d ago · 7 · new model tool inference api update

PaddleOCR-VL-1.6 is an upgraded document parsing model achieving 96.33% SOTA on OmniDocBench with improved table/formula/chart recognition and zero-cost migration from v1.5. The release includes CLI and Python API usage patterns, vLLM integration support, and transformers library compatibility for document understanding tasks.