News Nug

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

HuggingFace Blog · 50m ago · 5

[R] What 1000+ Harness Experiments Taught Me About Self-Improving Agents [R]

r/MachineLearning · 1h ago · 7 · agent research workflow

A systems-focused writeup on building self-improving AI agent harnesses for benchmark tasks, exploring the challenge of safely compounding agent-proposed improvements and parallels to coding-agent customization patterns. The author shares both successful and failed approaches to implementing continuous self-improvement loops, offering practical insights for engineers building autonomous improvement systems.

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

r/MachineLearning · 5h ago · 8 · tool open source benchmark workflow

noisekit is an open-source tool that generates realistic degraded audio datasets from clean annotated speech data, enabling accurate STT vendor benchmarking under production conditions (phone noise, codecs, reverb). It fills a critical gap for voice agent builders by providing WER-measurable datasets that approximate real-world phone call audio rather than relying on clean studio recordings.

EMA-Gated Temporal Sequence Compression in Vision Transformers [P]

r/MachineLearning · 5h ago · 8 · research inference open source library

NeuroFlow is a training-free dynamic routing framework for Vision Transformers that achieves 55.8× wall-clock speedup on high-res video inference by eliminating redundant tokens via semantic surprise tracking in embedding space. The approach uses a dual-memory architecture with retinal gating and cortical caching to maintain 97%+ fidelity while achieving extreme sparsity (84% token reduction), with code and paper publicly available.

Cross-species RSA: same learning rules (BP, PC, STDP, FA) tested against both human fMRI and macaque electrophysiology [P]

r/MachineLearning · 6h ago · 7 · research benchmark open source

Cross-species neuroscience study comparing learning rules (BP, FA, PC, STDP) across human fMRI and macaque electrophysiology (V1/V2/V4/IT), finding that early visual alignment is conserved but IT alignment scales with model capacity rather than learning rule. Includes careful controls for stimulus confounds and capacity baselines, with code and companion papers provided.

Profiling PyTorch training without accidentally stalling the GPU [D]

r/MachineLearning · 6h ago · 7 · tutorial workflow benchmark

Explores the measurement paradox in PyTorch training profiling where synchronization calls can distort performance results, presenting CUDA events as a lightweight alternative to capture timing without forcing synchronization in the hot path. Useful as a first-pass profiling technique before deeper operator-level analysis with PyTorch Profiler or Nsight.

Building self-improving tax agents with Codex

OpenAI Blog · 11h ago · 6 · agent workflow api update

Case study on building a self-improving tax agent using OpenAI's Codex for automating tax filings and improving accuracy through iterative refinement. Demonstrates practical application of code generation models to domain-specific workflow automation.

A Tiny Open-Source Self-Driving AI That Runs on a Phone [P]

r/MachineLearning · 12h ago · 7 · open source inference deployment

Open-source 7MB autonomous driving model that learns visual navigation, lane following, and drift recovery for edge deployment on lightweight hardware. Demonstrates practical real-time inference optimization for complex perception tasks without cloud infrastructure, valuable for understanding model compression and embedded AI systems.

[R]GNN Model For Fraud Detection Isn't Performing Well[R]

r/MachineLearning · 13h ago · 6 · research benchmark

A researcher shares their struggling GNN implementation for fraud detection on IEEE CIS dataset, achieving suboptimal performance (AUC 0.87, PR-AUC 0.52) across multiple architectures (GCN, GraphSAGE, GAT). This is practical ML engineering content with specific technical challenges but lacks novel insights—relevant for learning what not to do and potential debugging approaches.

Reachy Mini goes fully local

HuggingFace Blog · 18h ago · 8 · tutorial tool open source workflow inference deployment

Technical guide for building a fully local speech-to-speech pipeline (VAD → STT → LLM → TTS) with Reachy Mini robot using open-source tools like llama.cpp, Parakeet, and Qwen3TTS. Demonstrates how to run conversational AI systems without cloud dependencies, with modular component swapping and customization for latency/quality tradeoffs.

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

HuggingFace Blog · 18h ago · 8 · tool workflow open source research inference

A practical open-source solution for efficient async RL training using sparse weight deltas instead of full model snapshots, reducing synchronization overhead from ~1TB to ~20GB per checkpoint. The approach leverages bf16 arithmetic properties where 98% of weights remain bit-equivalent between steps, enabling asynchronous weight distribution via shared storage (S3) without direct trainer-inference connectivity.

Augmented Equivariant Mesh Networks for Anatomical Mesh Segmentation (ICML 2026 Workshops) [R]

r/MachineLearning · 1d ago · 7 · research library benchmark

EAMS presents an Equivariant Mesh Neural Network framework for robust anatomical mesh segmentation across medical imaging tasks (dental, liver, aneurysm), maintaining performance under geometric perturbations like patient pose variation where standard methods degrade by 25+ IoU points. The work combines intrinsic mesh descriptors with anatomy-aware PCA-derived priors in a lightweight (<2M parameter) architecture, demonstrating that equivariance principles from molecular modeling transfer effectively to 3D medical mesh tasks despite trade-offs in capturing subtle asymmetric features.

Tomesphere, 3M paper pages with TLDRs, peer reviews, code, and a SPECTER2 similarity graph [P]

r/MachineLearning · 1d ago · 8 · tool open source rag

Tomesphere is a free research paper discovery platform indexing 3M arxiv/OpenAlex papers with AI-generated TLDRs, peer reviews, GitHub repos, HuggingFace models, and semantic similarity search using SPECTER2 embeddings in pgvector. The semantic graph approach enables discovery of topically related papers beyond citation networks, with a Chrome extension for arxiv integration and multiple ranking modes (influential, recent, hidden gems, nearest neighbors).

Microsoft Copilot Cowork Exfiltrates Files

Simon Willison · 1d ago · 7 · agent security prompt engineering

Microsoft Copilot Cowork contained a critical security vulnerability where agentic systems could exfiltrate files through unapproved email messages with external image requests and pre-authenticated OneDrive links. This highlights a major design challenge in building safe autonomous agents: preventing prompt injection attacks from enabling data theft while maintaining agent autonomy.

Verbosity is not faithfulness: an architectural argument that reasoning models cannot perform faithful inference [D]

r/MachineLearning · 1d ago · 7 · research agent inference

A technical essay critiques reasoning models' ability to perform faithful inference, arguing that jointly-generated reasoning traces and final answers lack genuine separation of concerns. The piece engages empirically with recent work (Lanham/Turpin/Mirzadeh) and compares architectural approaches (HRM, TRM, GRAM, AlphaProof, Kona/Aleph), offering conceptual framing around constraints vs. influence that's relevant for engineers building reasoning systems.

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

r/LocalLLaMA · 1d ago · 7 · new model library inference

MOSS-TTS-v1.5 expands multilingual text-to-speech capabilities to 31 languages with improved performance through FlashAttention 2 support and optimized dependencies. The update maintains backward compatibility with v1.0 while adding support for languages like Cantonese, Hindi, Thai, and Vietnamese, with straightforward installation and generation APIs.

[P] Built a portable GPU ISA after reading too many architecture manuals [P]

r/MachineLearning · 1d ago · 9 · tool open source inference deployment

WAVE is a portable GPU kernel abstraction layer that compiles to a unified binary compatible with Metal, PTX, HIP, and SYCL across Apple, NVIDIA, and AMD hardware. This solves a critical pain point for AI engineers building cross-platform systems—write kernels once and deploy identically across diverse GPU architectures with verified PyTorch integration.

[D] Where do you go for serious AI research discussion online? [D]

r/MachineLearning · 1d ago · 6 · workflow

A Reddit discussion asking for ML/AI community recommendations focused on deep technical work—papers, training dynamics, model debugging, and infrastructure challenges rather than LLM API projects. The post seeks spaces for sharing specific technical problems (e.g., anomalies in SSL training) and receiving substantive expert feedback.

Qwen3.5 27B Uncensored Heretic Native MTP Preserved is Out Now With the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats!

r/LocalLLaMA · 1d ago · 7 · tutorial inference open source deployment

Practical guide covering multiple inference frameworks (Transformers, llama-cpp-python, vLLM, SGLang, Ollama, etc.) for running a 27B quantized Qwen model. Includes GGUF quantization options and benchmark comparisons showing minimal accuracy degradation, useful for engineers optimizing local model deployment.

Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

r/LocalLLaMA · 1d ago · 6 · open source inference deployment fine tuning

Guide for using a fine-tuned Qwen 3.5-35B variant (with reduced content restrictions) across multiple inference frameworks including Transformers, vLLM, and SGLang, with MMLU benchmark results (83.72% accuracy) and multiple quantization options available. Practical for engineers looking to deploy modified open-source models with different inference backends.