News Nug

[R] What 1000+ Harness Experiments Taught Me About Self-Improving Agents [R]

r/MachineLearning · 1h ago · 7 · agent research workflow

A systems-focused writeup on building self-improving AI agent harnesses for benchmark tasks, exploring the challenge of safely compounding agent-proposed improvements and parallels to coding-agent customization patterns. The author shares both successful and failed approaches to implementing continuous self-improvement loops, offering practical insights for engineers building autonomous improvement systems.

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

r/MachineLearning · 5h ago · 8 · tool open source benchmark workflow

noisekit is an open-source tool that generates realistic degraded audio datasets from clean annotated speech data, enabling accurate STT vendor benchmarking under production conditions (phone noise, codecs, reverb). It fills a critical gap for voice agent builders by providing WER-measurable datasets that approximate real-world phone call audio rather than relying on clean studio recordings.

Profiling PyTorch training without accidentally stalling the GPU [D]

r/MachineLearning · 6h ago · 7 · tutorial workflow benchmark

Explores the measurement paradox in PyTorch training profiling where synchronization calls can distort performance results, presenting CUDA events as a lightweight alternative to capture timing without forcing synchronization in the hot path. Useful as a first-pass profiling technique before deeper operator-level analysis with PyTorch Profiler or Nsight.

Building self-improving tax agents with Codex

OpenAI Blog · 11h ago · 6 · agent workflow api update

Case study on building a self-improving tax agent using OpenAI's Codex for automating tax filings and improving accuracy through iterative refinement. Demonstrates practical application of code generation models to domain-specific workflow automation.

Reachy Mini goes fully local

HuggingFace Blog · 18h ago · 8 · tutorial tool open source workflow inference deployment

Technical guide for building a fully local speech-to-speech pipeline (VAD → STT → LLM → TTS) with Reachy Mini robot using open-source tools like llama.cpp, Parakeet, and Qwen3TTS. Demonstrates how to run conversational AI systems without cloud dependencies, with modular component swapping and customization for latency/quality tradeoffs.

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

HuggingFace Blog · 18h ago · 8 · tool workflow open source research inference

A practical open-source solution for efficient async RL training using sparse weight deltas instead of full model snapshots, reducing synchronization overhead from ~1TB to ~20GB per checkpoint. The approach leverages bf16 arithmetic properties where 98% of weights remain bit-equivalent between steps, enabling asynchronous weight distribution via shared storage (S3) without direct trainer-inference connectivity.

[D] Where do you go for serious AI research discussion online? [D]

r/MachineLearning · 1d ago · 6 · workflow

A Reddit discussion asking for ML/AI community recommendations focused on deep technical work—papers, training dynamics, model debugging, and infrastructure challenges rather than LLM API projects. The post seeks spaces for sharing specific technical problems (e.g., anomalies in SSL training) and receiving substantive expert feedback.

Reconstructing the agent methodology: Decoupling decision-making and execution - open source [P]

r/MachineLearning · 2d ago · 8 · open source agent workflow tool

Spice is an open-source decision layer framework that sits above execution agents to make agent decision-making explicit and interpretable. It captures what was observed, options considered, reasoning for selection, trade-offs rejected, and execution outcomes—addressing a key gap where agents excel at execution but lack transparent decision-making processes. The project is early-stage but functional, installable, and designed to work with existing agents like Claude Code and other tools.

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

HuggingFace Blog · 2d ago · 7 · agent workflow

A practical glossary clarifying commonly confused terminology in AI agent development (model, scaffold, harness, tool definitions) with examples from frameworks like Claude Code and Codex. Provides mental models for understanding agent architecture that's essential when building or deploying agentic systems, though not a technical tutorial.

MergeNB: An intuitive merge conflict resolver built for Jupyter notebooks in VS Code [P]

r/MachineLearning · 2d ago · 7 · tool open source workflow

MergeNB is a VS Code extension that improves Jupyter Notebook merging for collaborative workflows, addressing pain points with existing tools like nbdime. The tool features a web UI and plans to expand as a git mergetool, offering practical improvements for teams managing notebook-based research and development.

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D]

r/MachineLearning · 2d ago · 5 · research workflow

This is a technical discussion about evaluating self-supervised learning (SSL) methods like BYOL and JEPA, questioning whether the RankMe metric (embedding effective rank via SVD) remains meaningful as an evaluation criterion when incorporated as a loss term during training. The post explores the tension between using metrics to assess learning quality versus explicitly optimizing them, relevant for practitioners evaluating SSL model representations.

Quoting Armin Ronacher

Simon Willison · 2d ago · 6 · prompt engineering workflow

Armin Ronacher discusses a growing problem in open-source development where AI-generated issue reports obscure actual user observations with confident but often inaccurate interpretations, making debugging harder. The post highlights practical friction when LLMs are used to process and reword user problems without preserving the original observed facts.

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA [D]

r/MachineLearning · 3d ago · 8 · benchmark rag inference workflow

Comprehensive benchmark comparing vision-capable LLMs (native PDF) against OCR-based RAG pipelines on long document processing, showing OCR approaches achieve higher accuracy (59.6% vs 52.0%) and lower cost ($0.19 vs $0.25/query) despite the 'vision makes OCR obsolete' narrative. Key findings: vision LLMs struggle with tables/charts, have a 7% failure rate on large PDFs that survives retries, while premium OCR + layout extraction proves more robust for document-heavy workloads.

pipeline is really slow - consulting [D]

r/MachineLearning · 4d ago · 6 · workflow inference benchmark

A software engineer debugging significant training bottlenecks in a robotics imitation learning pipeline (ResNet18 + DiT policy, 50M params) experiencing 10 iterations/sec throughput with low GPU utilization despite high CPU usage. The profiler data suggests dataloader and optimizer operations are consuming 62%+ of time, indicating potential CPU-GPU synchronization issues, inefficient data pipeline design, or framework overhead rather than compute-bound problems.

Is personalized AI memory actually a problem worth solving or am I just coping[D]

r/MachineLearning · 4d ago · 6 · rag workflow prompt engineering

Reddit discussion proposing a personalized cognitive profiling system that tracks not just facts but learning patterns, struggling points, and effective explanation styles to improve LLM context retrieval over time. The idea combines dynamic profiling with RAG-like personalization to create an evolving understanding of how individual users think, rather than basic chat memory.

Spice: We built an open-sourced decision layer that sits above your AI agents (controls agent actions before execution) [P]

r/MachineLearning · 4d ago · 7 · open source agent tool workflow

Spice is an open-source decision layer framework that sits above execution agents, providing context-aware task routing and decision-making through a perception → simulation → decision → execution → reflection loop. Rather than replacing agents like Claude or Codex, it adds orchestration capabilities including state modeling, option simulation, and outcome reflection to coordinate multi-agent workflows.

Tested chunking + embeddings data from 3 production websites. [P]

r/MachineLearning · 4d ago · 7 · rag workflow benchmark

This post demonstrates practical RAG optimization techniques including tiered retrieval scoring, corpus-quality awareness metrics, and empirical results across three real-world datasets with varying content density. The author introduces a 'yield score' metric to predict generation quality and notes that semantic relevance still performs reasonably well even on thin, positioning-heavy corpora—a pattern RAG benchmarks typically don't account for.

[AINews] All Model Labs are now Agent Labs

Latent Space · 4d ago · 6 · agent workflow api update

Industry shift from models as primary product to agents as integrated systems combining models, harnesses, UI, and workflows. Major players (OpenAI, AI21, DeepSeek) are building dedicated agent teams and reducing standalone model focus, with concrete shipping examples like OpenAI's Codex updates and Claude's auto-mode expansion showing product differentiation moving beyond model quality alone.

Custom image encoder [P]

r/MachineLearning · 4d ago · 6 · inference deployment workflow

Discussion of whether to build a custom lightweight image encoder for video frame classification instead of using foundation models like CLIP/DINO, with focus on CPU inference speed and deployment constraints. The poster describes a practical pipeline processing video streams through embeddings into a small transformer, seeking guidance on whether custom training on domain-specific data (few million images, 4-5 labels) would improve both speed and accuracy versus established encoders.

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

r/MachineLearning · 5d ago · 7 · benchmark workflow agent

Community discussion identifying gaps between standard benchmarks and real-world AI system robustness, particularly around ambiguous intent, context handling, and multi-turn sessions. Highlights the disconnect between optimizing for clean evaluation metrics versus building production-resilient systems.