News Nug

My agent diagnosed a bug in its own system and routed around it unprompted [P]

r/MachineLearning · 54d ago · 8 · agent open source workflow research

Springdrift is a persistent runtime architecture for LLM agents featuring append-only memory, OTP supervision, and passive sensorium (injected self-state context) instead of tool-call-based introspection. The post demonstrates practical advantages through a real example where the agent autonomously diagnosed a missing writer agent without diagnostic tool calls and routed around the error. This workflow design enables LLM agents to serve as collaborative pair programmers on their own systems.

Low accuracy (~50%) with SSL (BYOL/MAE/VICReg) on hyperspectral crop stress data — what am I missing? [R]

r/MachineLearning · 54d ago · 7 · research tutorial fine tuning

A practitioner shares a real hyperspectral classification problem with SSL pretraining stuck at ~45-50% accuracy on nitrogen stress detection in crops. The post discusses SSL method choices (BYOL, MAE, VICReg), data augmentation strategies, and model architectures (ViT vs CNN), providing practical debugging insights for domain-specific computer vision tasks.

Looking for help from people who built multi Agents systems [P]

r/MachineLearning · 54d ago · 6 · agent tool benchmark

Engineer shares a chaos engineering framework they built for testing multi-agent systems in production, designed to prevent customer-facing failures. They're seeking collaboration to develop it further and establish benchmarking capabilities for agent reliability.

datasette 1.0a28

Simon Willison · 54d ago · 6 · api update tool workflow

Datasette Cloud 1.0a27 fixes breaking changes from a previous alpha release, with development accelerated using Claude Code and the new Claude Opus 4.7 model. While the tool update is niche, the mention of Claude Opus 4.7 and AI-assisted development workflow shows practical application of new model capabilities.

browser-harness — Browser Harness | Self-healing harness that enables LLMs to complete any task.

GitHub Trending AI · 54d ago · 7 · tool open source agent workflow

Browser Use is an open-source framework that connects LLMs directly to Chrome via CDP for autonomous browser automation tasks. It features a self-healing harness that improves through execution, community-contributed domain skills for specific sites, and a free cloud tier with concurrent browser support—enabling agents to handle complex web interactions without intermediaries.

[AINews] Anthropic Claude Opus 4.7 - literally one step better than 4.6 in every dimension

Latent Space · 54d ago · 9 · new model api update inference benchmark

Claude Opus 4.7 launched with significant improvements: new tokenizer enabling up to 35% higher token efficiency despite 50% reduction in overall token usage, vision capabilities expanded to 2,576px (3.75MP) enabling pixel-perfect multimodal work, and new 'xhigh' reasoning effort level with 11-point SWE-Bench Pro improvement for code tasks. Pricing unchanged at $5/$25 per million tokens, making this a critical update for AI engineers doing coding, computer use agents, and vision-dependent workflows.

Introducing Claude Opus 4.7ProductApr 16, 2026Our latest Opus model brings stronger performance across coding, agents, vision, and multi-step tasks, with greater thoroughness and consistency on the work that matters most.

Anthropic Blog · 54d ago · 10 · new model api update benchmark deployment

Claude Opus 4.7 is now generally available with significant improvements in software engineering tasks, complex multi-step reasoning, and vision capabilities—handling previously-supervised coding work autonomously. The model is accessible via Claude API (claude-opus-4-7), all major cloud platforms, and maintains Opus 4.6 pricing ($5/$25 per million tokens), with intentionally reduced cybersecurity capabilities and new safeguards for responsible deployment.

What should happen when you feed impossible moves into a chess-playing language model? [D]

r/MachineLearning · 54d ago · 5

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

r/MachineLearning · 55d ago · 7 · research training inference

ResBM introduces a residual bottleneck architecture for efficient pipeline-parallel training that achieves 128× activation compression while maintaining convergence, directly addressing bandwidth constraints in distributed AI model training. The work combines encoder-decoder bottlenecks with low-rank identity paths and demonstrates practical results using Muon optimization, relevant for engineers optimizing large-scale model training infrastructure.

Anyone else get more excited for new open source models than new flagship ones?

r/LocalLLaMA · 55d ago · 5

Can frontier AI models actually read a painting? [R]

r/MachineLearning · 55d ago · 6 · research benchmark

An experiment testing frontier multimodal models' ability to appraise fine art from vision alone, revealing a gap between visual recognition and commitment to vision-based decisions. The analysis compares image-only vs. image+metadata approaches across GPT-4o, Claude 3.5 Sonnet, Gemini 3.1 Pro, and others, with implications for understanding multimodal model behavior and visual grounding.

Codex for (almost) everything

OpenAI Blog · 55d ago · 7 · tool agent workflow

Codex app gains computer use capability, in-app browsing, image generation, and memory features that enable more autonomous agent behaviors for developers. The plugin system and memory persistence could streamline repetitive coding workflows and integrate with existing development tools.

[AINews] RIP Pull Requests (2005-2026)

Latent Space · 55d ago · 6 · agent workflow deployment api update

Article discusses how AI is changing software development workflows, particularly the potential decline of pull requests and code reviews in favor of prompt-based contributions and agent-oriented development. Covers OpenAI's new Agents SDK with sandbox integrations (Modal, Cloudflare, e2b, Vercel) enabling stateless orchestration + stateful execution patterns, plus Cloudflare's agent tools—relevant for understanding emerging AI agent deployment architectures.

HY-World 2.0 released

r/LocalLLaMA · 55d ago · 8 · new model open source research

HY-World 2.0 is an open-source multimodal world model that generates editable 3D assets (meshes/Gaussian Splatting) from text, images, or videos—a paradigm shift from video-only world models. The framework includes WorldMirror 2.0 for 3D reconstruction and supports interactive exploration, with all model weights and code being released for reproducibility.

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

r/MachineLearning · 55d ago · 8 · research open source fine tuning

An undergraduate researcher identifies and solves a critical optimization pathology in multi-timescale Actor-Critic architectures where temporal attention mechanisms exploit policy gradients ('Surrogate Objective Hacking') or collapse to short-horizon policies. The proposed solution decouples the Actor from multi-timescale Critic representations, forcing robust auxiliary learning while isolating policy updates to long-term advantages, demonstrated via a minimal PyTorch implementation.

Built an political benchmark for LLMs. KIMI K2 can't answer about Taiwan (Obviously). GPT-5.3 refuses 100% of questions when given an opt-out. [P]

r/MachineLearning · 55d ago · 8 · benchmark open source research api update

An engineer built an open-source benchmark that evaluates frontier LLMs (GPT-5.3, Claude Opus 4.6, KIMI K2) on political stance using a 2D compass across 98 structured questions, revealing critical insights: refusal behavior itself is a measurable political signal, opt-out options dramatically change model outputs (Claude flipped quadrants when given permission to decline), and different models show distinct censorship patterns (KIMI blocks geopolitical content via API errors, GPT opts out universally when allowed). The repo is directly runnable on any model with an API.

AI for Materials Science starter kit [D]

r/MachineLearning · 55d ago · 5 · tutorial research

A Reddit post asking for learning resources in AI for Materials Science and computational chemistry, mentioning a UChicago course on applied AI. While potentially useful for engineers exploring domain-specific AI applications, it's primarily a community question rather than technical content or a concrete tool/resource announcement.

Introducing GPT-Rosalind for life sciences research

OpenAI Blog · 55d ago · 7 · new model research inference

OpenAI released GPT-Rosalind, a specialized reasoning model optimized for scientific domains like drug discovery, genomics, and protein analysis. While domain-specific, it represents a new model variant worth understanding for engineers building AI applications in biotech and scientific research contexts.

datasette.io news preview

Simon Willison · 55d ago · 7 · tool workflow tutorial

Simon Willison built a custom preview UI using Claude Artifacts to validate YAML news files and catch markdown/YAML errors before deployment. This demonstrates a practical workflow for using Claude's code generation capabilities to reduce friction in content management tasks, leveraging Claude's ability to analyze GitHub repositories directly in conversation.

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

HuggingFace Blog · 55d ago · 8 · tutorial fine tuning library rag

Practical tutorial on finetuning Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR) tasks using Sentence Transformers, demonstrating significant performance gains (NDCG@10: 0.947 vs 0.888 baseline) through domain-specific adaptation. Covers the multimodal training pipeline, dataset construction, and implementation details for engineers building with vision-language models.