r/MachineLearning · 7h ago · 7 · library open source tool

AutoMuon is a new Python package that enables the Muon optimizer as a drop-in AdamW replacement for PyTorch training by automatically selecting the appropriate optimizer for each parameter type (Muon for 2D weight matrices, AdamW for embeddings/norms/biases). The tool abstracts away manual optimizer selection complexity and is open for community contributions to handle edge cases across different architectures beyond transformers and CNNs.

r/MachineLearning · 17h ago · 7 · research agent tutorial

This article explores the mathematical foundations of Visual-Language-Action (VLA) models for robotics, covering representation learning, latent space projections, and the critical role of teleoperation in training humanoid robots. It synthesizes insights from recent VLA architectures and demonstrates why imitation learning and human demonstrations are essential for efficient policy learning in robotic control tasks.

Simon Willison · 22h ago · 8 · new model api update agent

OpenAI has unified Codex into the main GPT model line starting with GPT-5.4, with GPT-5.5 showing significant improvements in agentic coding, computer use automation, and general task execution. This represents a shift in how OpenAI structures and releases coding capabilities—no longer as separate specialized models but integrated into the flagship model.

r/MachineLearning · 1d ago · 7 · agent workflow tool

A software engineer discusses architectural approaches for combining deterministic financial calculations (using Python/Polars) with LLM-based natural language generation for market risk reporting. The core challenge is balancing mathematical precision with dynamic scenario handling—comparing strategies like agentic workflows (LLMs writing/executing code in sandboxes) versus pre-computed cubes with structured prompts, with specific interest in frameworks like LangChain and PandasAI.

Latent Space · 1d ago · 9 · new model inference open source research agent

DeepSeek released V4 Pro and Flash models featuring 1M token context via novel Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) techniques, achieving 27% FLOP reduction and 10% KV cache savings compared to V3.2. The models use a 1.6T MoE architecture trained on 32T tokens with both Base and Instruct variants released under MIT license, placing them competitively near top open-weight models with particularly strong long-context and agentic performance.

Simon Willison · 1d ago · 8 · api update prompt engineering tutorial

OpenAI released GPT-5.5 with new API access and published a comprehensive prompting guide covering practical tips like streaming thinking tokens for long-running tasks. The guide emphasizes treating GPT-5.5 as a new model family requiring fresh baseline tuning rather than direct migration from previous versions, with specific advice on optimizing reasoning effort, verbosity, and output formatting.

r/MachineLearning · 1d ago · 8 · benchmark agent rag open source tool

Open-source benchmark suite (paper-lantern-challenges) measuring coding-agent performance improvements from retrieval-augmented technique selection across 9 practical tasks, showing gains up to +32% on contract extraction and +24.5% on test generation. Fully reproducible with diffable prompts, predictions, and evaluation scripts; demonstrates concrete RAG patterns for agentic coding workflows using Claude Opus and Gemini Flash with access to literature-based technique retrieval tools.

Anthropic Blog · 1d ago · 6 · api update deployment inference

Anthropic and Amazon expanded their partnership with a $100B+ commitment securing up to 5GW of AWS capacity for Claude training and deployment, including new Trainium2/3/4 custom silicon chips coming online in 2025-2026. The Claude Platform will now be available directly within AWS with unified billing and governance, and Amazon is investing an additional $5B with options for $20B more, positioning Claude as available across all three major cloud platforms.

r/MachineLearning · 1d ago · 8 · open source fine tuning benchmark inference library

DharmaOCR is an open-source OCR system with fine-tuned 3B and 7B SLMs using SFT + DPO that outperforms proprietary models (0.925 F1 for 7B variant). The project includes published methodology, benchmarks against GPT-5.4/Gemini/Claude, and demonstrates practical optimizations like AWQ quantization reducing inference costs 22% with minimal performance loss.

r/MachineLearning · 1d ago · 6 · workflow tutorial benchmark

A software engineer discusses practical challenges with hyperparameter optimization (HPO) for large-scale ML models, specifically managing the gap between short HPO trials (~30 min with pruning) and full training runs (~24 hours), and concerns about whether pruning strategies bias toward fast convergence rather than optimal final performance. The post explores whether learning rate schedulers optimized on reduced-epoch HPO trials transfer well to full training and proposes solutions like restarting LR schedules during training.

Anthropic Blog · 1d ago · 5 · prompt engineering deployment

Anthropic outlines their approach to ensuring Claude provides balanced, accurate responses on political topics through constitutional training, system prompts, and evaluation metrics (95-96% impartiality scores on Opus/Sonnet). The post covers their usage policies preventing election interference and abuse detection mechanisms, but focuses primarily on governance and responsible AI deployment rather than technical implementation details.

r/MachineLearning · 1d ago · 7 · open source model tool deployment inference

BloomshotNet is an open-source YOLO-based blood detection model for content moderation, released with 23k+ annotated images, pre-trained weights (small/nano variants), and a CLI tool achieving ~0.8 precision at 40+ FPS on CPU. The technical deep-dive covers why specialized vision models outperform open-vocabulary approaches for this task, practical insights on video-level detection strategies, and architectural trade-offs between YOLO and transformers.

r/LocalLLaMA · 1d ago · 8 · benchmark inference quantization

Empirical study showing KV cache quantization (q8_0, q4_0) has significant, model-dependent quality impact—contrary to conventional wisdom that q8_0 is "practically lossless." Gemma models show substantial degradation (KL 0.108-0.377 at q8_0) while Qwen remains robust (KL <0.04), with detailed methodology using KL divergence across 250K tokens across 6 categories, enabling engineers to make informed quantization tradeoff decisions.

r/MachineLearning · 1d ago · 7 · fine tuning research open source

A novel fine-tuning approach that reduces LLM hallucinations by training models to contrast correct answers against counterfactual "bad" continuations from a frozen base model, using only ~10% of training data while achieving competitive results (1%p below SFT, 6%p below DPO) with consistent out-of-distribution generalization.

r/LocalLLaMA · 1d ago · 7 · api update inference workflow

Anthropic identified and resolved three separate issues affecting Claude Code, the Claude Agent SDK, and Claude Cowork (not the API) that degraded response quality—including problematic high reasoning effort defaults, session state bugs, and context window mismanagement. These fixes shipped in v2.1.116 with process improvements to catch similar issues faster through better monitoring and testing.

r/MachineLearning · 1d ago · 7 · library open source tool

Rose is a new open-source PyTorch optimizer with stateless design achieving sub-8bit AdamW memory usage while maintaining fast convergence and good generalization. The author provides MNIST benchmarks and invites community testing, though evaluation on larger models/datasets would strengthen the contribution.

Simon Willison · 2d ago · 9 · new model open source inference benchmark

DeepSeek released V4-Pro and V4-Flash, massive open-weight MoE models with 1M token context at dramatically lower costs ($0.14-$3.48/M tokens vs $15-60 for frontier models). V4-Pro is now the largest open-weight model at 1.6T parameters, with major efficiency improvements (10-27% FLOPs and 7-10% KV cache of V3.2 at 1M context), making it practical for local deployment on high-end hardware.