r/MachineLearning · 2d ago · 7 · fine tuning research open source

A novel fine-tuning approach that reduces LLM hallucinations by training models to contrast correct answers against counterfactual "bad" continuations from a frozen base model, using only ~10% of training data while achieving competitive results (1%p below SFT, 6%p below DPO) with consistent out-of-distribution generalization.

r/LocalLLaMA · 2d ago · 7 · api update inference workflow

Anthropic identified and resolved three separate issues affecting Claude Code, the Claude Agent SDK, and Claude Cowork (not the API) that degraded response quality—including problematic high reasoning effort defaults, session state bugs, and context window mismanagement. These fixes shipped in v2.1.116 with process improvements to catch similar issues faster through better monitoring and testing.

r/MachineLearning · 2d ago · 7 · library open source tool

Rose is a new open-source PyTorch optimizer with stateless design achieving sub-8bit AdamW memory usage while maintaining fast convergence and good generalization. The author provides MNIST benchmarks and invites community testing, though evaluation on larger models/datasets would strengthen the contribution.

Simon Willison · 2d ago · 9 · new model open source inference benchmark

DeepSeek released V4-Pro and V4-Flash, massive open-weight MoE models with 1M token context at dramatically lower costs ($0.14-$3.48/M tokens vs $15-60 for frontier models). V4-Pro is now the largest open-weight model at 1.6T parameters, with major efficiency improvements (10-27% FLOPs and 7-10% KV cache of V3.2 at 1M context), making it practical for local deployment on high-end hardware.

Latent Space · 2d ago · 9 · new model api update agent inference benchmark

OpenAI released GPT-5.5, a new model optimized for agentic coding and computer-use tasks with significant token efficiency gains and improved long-horizon execution. The model achieves strong performance on coding benchmarks (58.6% SWE-Bench Pro) and introduces better cost/performance tradeoffs ($5/$30 per 1M tokens for API), with integrated browser control and Codex consolidation forming OpenAI's superapp strategy.

r/MachineLearning · 2d ago · 6 · workflow open source fine tuning

Developer seeking advice on architecture selection for training a custom model on historical data, weighing Nanochat's training benefits against Llama's transformers ecosystem compatibility. The discussion involves practical considerations for open-source model release, pretraining at scale, and tooling interoperability for model training workflows.

r/MachineLearning · 2d ago · 7 · benchmark research

Empirical study demonstrating that hostile prompts degrade instruction-following performance across 14 model configurations (Llama 3.1, Mistral, Qwen3) by ~7.4pp at 7-8B scale, with the effect persisting across quantization levels and model sizes but attenuating at larger scales. The finding replicates across three independently developed training recipes and affects both dense and MoE architectures, suggesting a systematic vulnerability in current instruction-tuned models.

Simon Willison · 2d ago · 7 · tool library open source

Honker is a Rust SQLite extension that brings Postgres NOTIFY/LISTEN semantics to SQLite, enabling queue and durable stream patterns with Python bindings. It implements the transactional outbox pattern using WAL mode polling to achieve near-real-time event handling without expensive SQL queries.

r/MachineLearning · 2d ago · 7 · benchmark research prompt engineering

Empirical study demonstrating that hostile/adversarial prompts consistently degrade instruction-following performance (7.4pp mean drop at 7-8B scale) across 14 model configurations spanning Llama 3.1, Mistral, and Qwen3 at various scales and quantization levels. The effect persists robustly across architecture, precision, routing strategy, and training recipe, though it attenuates with model scale—critical for understanding real-world deployment vulnerabilities of instruction-tuned models.

Simon Willison · 2d ago · 7 · workflow deployment api update

Anthropic published a postmortem on Claude Code quality issues from the past two months, revealing three harness-level bugs rather than model defects—including a critical bug where context was incorrectly cleared every turn instead of once after idle sessions. For engineers building with Claude, this highlights the complexity of production LLM system architecture and the kinds of subtle bugs that can significantly impact user experience in long-running sessions.

HuggingFace Blog · 2d ago · 9 · new model agent inference research

DeepSeek-V4 introduces architectural innovations for long-context agentic workloads, using Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to reduce single-token inference FLOPs to 27% and KV cache to 10% of V3.2 while maintaining 1M context capability. The design directly addresses known failure modes in long-running agents (context exhaustion, KV cache overflow, degraded tool-call performance) through interleaved attention mechanisms and aggressive compression with learned gating.

Simon Willison · 2d ago · 7 · tool open source rag workflow

LiteParse is an open-source PDF text extraction tool that uses spatial heuristics and optional OCR (Tesseract) rather than AI models, now available as a browser-based version alongside its CLI form. For AI engineers building RAG systems, this is valuable for reliable document ingestion and the pattern of Visual Citations with Bounding Boxes enables more credible answer attribution in Q&A applications.

Simon Willison · 3d ago · 7 · new model tool api update open source

GPT-5.5 released with strong capabilities but API access delayed; article covers the technical workaround of using OpenAI's Codex subscription via a new LLM plugin that reverse-engineers authentication to access the model through existing subscriptions. Demonstrates practical integration patterns and agent framework compatibility that software engineers should know about.

Latent Space · 3d ago · 7 · agent workflow research fine tuning open source benchmark

A year-in-review podcast episode covering major shifts in AI engineering including agent architectures, domain-specific model training, open source momentum, and the consolidation around coding/specialized applications. Key technical themes include skills-based agent packaging, context/harness engineering, evals/observability infrastructure, and the emerging playbook of starting with frontier models before training custom models.

r/MachineLearning · 3d ago · 7 · tool open source library benchmark

Gladia-normalization is an open-source library that solves a common STT evaluation problem by normalizing transcripts before WER calculation, eliminating penalties for formatting differences ("$50" vs "fifty dollars"). The library uses configurable YAML-defined pipelines for deterministic, version-controllable text normalization across 6 languages with MIT licensing.

r/MachineLearning · 3d ago · 7 · tool open source library workflow

AfterImage now includes OpenSimula, an open-source Python tool implementing mechanism-design-based dataset generation for controlled diversity in SFT/eval workflows. It automates factor taxonomy creation, weighted sampling, meta-prompt diversification, and refinement loops to generate structured training data, with built-in observability and cost controls for large-scale generation.

r/MachineLearning · 3d ago · 7 · inference workflow tutorial

A practitioner seeks guidance on transformer compression techniques beyond FP16 and pruning, evaluating low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, and hardware-specific optimizations. This represents a real-world optimization challenge with practical comparison of post-training compression methods (GPTQ, AWQ, SmoothQuant, LoRA-style compression).

OpenAI Blog · 3d ago · 9 · new model api update

GPT-5.5 is OpenAI's latest model release offering improved performance and speed for technical tasks including coding, research, and data analysis. This represents a significant capability upgrade directly relevant to software engineers building with AI, with enhanced tool integration support.