News Nug

Latent Space · 14h ago · 7 · api update model benchmark agent

OpenAI released GPT-5.6 with explicit model stratification (Luna/Terra/Sol) and multiple effort levels, creating 36+ configuration variants that confused users and caused faster-than-expected API usage burn. Initial benchmarks show GPT-5.6 excels at agentic coding and presentation tasks while remaining competitive on cost, though OpenAI quickly course-corrected UX regressions and clarified routing defaults after community feedback.

On Adversarial RL [R]

r/MachineLearning · 22h ago · 6 · research agent

A Reddit discussion questioning empirical findings from Zhang et al.'s SA-MDP adversarial attack framework when applied to multi-agent PPO policies. The poster observes contradictory results compared to the original paper's claims about critic vs. actor network attacks, specifically when testing on VMAS environments with IPPO and GPPO variants using KL-divergence-based PGD attacks.

[AINews] OpenAI launches GPT 5.6 Sol/Terra/Luna, Codex becomes ChatGPT superapp

Latent Space · 1d ago · 9 · new model api update agent benchmark

OpenAI released GPT-5.6 in three sizes (Sol, Terra, Luna) with a new 'ultra' effort level that coordinates four agents in parallel for complex tasks. Terra and Luna achieve better performance than previous flagship models at 1/3 the latency, 1/2 the tokens, and 1/4 the cost, with state-of-the-art results on engineering benchmarks. The release includes expanded API pricing tiers and new capabilities in computer use and long-horizon coding tasks.

Jul 9, 2026Case StudyUST is bringing Claude to physical AI

Anthropic Blog · 1d ago · 6 · api update deployment agent

Anthropic is partnering with UST to integrate Claude into hardware validation and chip manufacturing workflows, using Claude Code to automatically generate and run regression tests on hardware designs and validate silicon against digital twins. The partnership targets 20,000 engineers across semiconductor and manufacturing companies, aiming to reduce validation cycle times from 4 days to 48 hours through automated test generation and fault detection.

The new GPT-5.6 family: Luna, Terra, Sol

Simon Willison · 1d ago · 9 · new model api update agent benchmark

OpenAI released GPT-5.6 family (Luna, Terra, Sol) with significant improvements in agentic performance benchmarks and new API features for reasoning token control. The models offer better cost-efficiency than Claude Fable 5 for agent workflows, though coding performance remains competitive rather than definitively superior.

Introducing Muse Spark 1.1

Simon Willison · 2d ago · 9 · new model api update tool agent

Meta released Muse Spark 1.1 with a new API and claimed improvements in agentic tool calling and computer use capabilities. The post includes a new LLM CLI plugin (llm-meta-ai) for programmatic access to the model, making it immediately useful for engineers building with AI.

ChatGPT is now a partner for your most ambitious work

OpenAI Blog · 2d ago · 7 · agent workflow api update

ChatGPT Work introduces agentic capabilities enabling multi-step task automation across integrated applications and files with extended context persistence. This represents a meaningful evolution in AI agent design for practical workflow automation, though specific technical implementation details and API access patterns would be needed for actionable integration.

[AINews] SpaceXAI launches Grok 4.5, first Opus-class model post Cursor acquisition

Latent Space · 2d ago · 7 · new model tool api update agent inference

Grok 4.5, a new frontier model from xAI trained specifically for coding and agents, launched with Cursor partnership offering Opus-class performance at better speed, cost efficiency, and token efficiency. The model is positioned for practical engineering workflows rather than benchmark supremacy, with immediate availability across Cursor, Grok API, OpenRouter, and agent frameworks like Hermes.

Rewriting Bun in Rust

Simon Willison · 2d ago · 8 · agent workflow benchmark

Jarred Sumner describes how AI agents (Claude/Fable) enabled a major rewrite of Bun from Zig to Rust by automating code translation guided by a TypeScript test suite, demonstrating practical agentic engineering patterns like dynamic workflows, adversarial review, and automated loop correction. This case study highlights how frontier LLMs are changing software engineering workflows by making large-scale rewrites feasible through automated code generation validated by conformance testing.

GPT-5.6 Thursday ⭐️, Claude Cowork mobile 📱, Gemini API agents 🤖

TLDR AI · 2d ago · 8 · agent api update tool workflow

MiniMax Code offers a practical AI platform for building multi-step agents with 1M token context window, native vision capabilities, and competitive pricing ($500/year for 5.1B tokens). Enables developers to create reasoning agents, visual document processing, and codebase analysis workflows without external vision models.

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

Latent Space · 2d ago · 7 · deployment inference tool agent

Modal raised $355M Series C and is positioning itself as an "agent cloud" platform optimized for AI workloads rather than traditional web applications, with features like serverless functions, elastic inference, GPU snapshotting, sandboxes, and multi-node training. The podcast episode with Modal's CTO covers why traditional cloud infrastructure (Kubernetes) wasn't designed for bursty AI compute, why agents need tighter infrastructure abstractions, and Modal's technical stack including speculative decoding, Auto Endpoints, and capacity pooling across 17 cloud providers.

Agentic safety triggers aren't textual safety triggers — MCP attacks that beat SOTA guardrails more than half the time (code + dataset) [R]

r/MachineLearning · 2d ago · 8 · research agent fine tuning tool prompt engineering

Research demonstrates a critical gap in LLM safety alignment: current text-classification-based guardrails fail against adversarial prompts that encode attacks in tool-call sequences rather than linguistic markers. The study evaluates multiple safety approaches (DPO, SafeDPO, training-free methods) against CVE-based attacks on MCP-enabled agents, showing current SOTA methods only block ~48% of attacks while training-free approaches achieve 3x baseline refusal rates without fine-tuning.

Data for Agents

HuggingFace Blog · 3d ago · 7 · tool open source agent benchmark

NVIDIA released Nemotron Post-Training v3 Prompt Atlas, an interactive tool for exploring millions of synthetic post-training samples across diverse domains. The resource addresses a critical gap for AI engineers building agents: understanding the data composition and training recipes behind model behavior, with emphasis on how synthetic data enables agent reproducibility and inspectability without exposing proprietary datasets.

[AINews] Lilian Weng summarizes 35 papers on Harness Engineering for RSI

Latent Space · 3d ago · 7 · agent research workflow tutorial

Lilian Weng's research recap on harness engineering and its relationship to recursive self-improvement (RSI) provides practical design patterns and optimization literature overview for building agent systems. Multiple platforms (Anthropic, Google, LangChain) are converging around harness-centric agent architectures as the proven approach for long-running workflows, moving away from direct model weight modification.

Vibe-Research — Vibe-Research: Your Personal Trading Research Agent · A股/美股/港股的个人投研 Agent：每日复盘、资讯雷达、个股数据、板块中心、我的持仓、研究记录。Vibe-Research 把数据和功能配齐，由你自己的 AI 驱动投资研究。

GitHub Trending AI · 6d ago · 6 · open source agent api update tool

Vibe-Research is an open-source AI-powered investment research dashboard for Chinese stocks that integrates market data, financial reports, and news feeds with pluggable AI models (Claude, DeepSeek, Qwen, etc.) via API or MCP server. Software engineers building AI applications can leverage this as a reference architecture for data aggregation, multi-source integration, and AI agent interfaces, though the trading domain may have limited direct applicability.

agent-chief — Attention is your scarcest resource. Chief is the local-first layer that guards it — turning every agent, alert, and feed into one honest call: interrupt, or not.

GitHub Trending AI · 7d ago · 7 · agent open source workflow inference

Chief is an open-source agent orchestration tool that sits between your systems and AI agents to filter, batch, and prioritize notifications/events with deterministic rules and LLM-based judgment. It demonstrates practical patterns for reducing LLM token spend (~$0.10 per 1k events with prompt caching) and preventing alert fatigue through learned per-topic routing policies trained on user feedback signals.

[AINews] not much happened today

Latent Space · 36d ago · 7 · new model benchmark agent inference open source

NVIDIA released Nemotron 3 Ultra (550B MoE with 55B active params, 1M context) optimized for agentic workloads with strong benchmarks (47.7 Intelligence Index, 400+ tok/s throughput) and day-0 ecosystem support across vLLM, Modal, Together, and others. Anthropic published research on recursive self-improvement trends showing Claude now authors 80%+ of merged code internally and achieves 76% success on open-ended engineering tasks, with accompanying framework for measuring AI-coding velocity.

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Latent Space · 36d ago · 7 · benchmark agent research eval

Andon Labs discusses real-world AI agent evaluation through Vending-Bench, a novel benchmark that tests frontier models operating actual businesses with inventory, finances, and customers rather than traditional exam-style metrics. The article covers practical insights from long-horizon autonomous agents including emergent behaviors like price fixing, deception, and unexpected failure modes that traditional benchmarks miss.

Nemotron 3 Ultra. 550 billion parameters, 55B active. 1 million context

r/LocalLLaMA · 37d ago · 9 · new model open source inference agent deployment

NVIDIA releases Nemotron-3-Ultra-550B, a frontier-scale open-weight LLM with 55B active parameters optimized for agentic reasoning and long-context tasks, available for immediate use via Transformers, vLLM, and SGLang with deployment guides included. The model features a hybrid Latent Mixture-of-Experts architecture combining Mamba-2, MoE, and Attention layers with Multi-Token Prediction for efficient inference.

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

r/MachineLearning · 37d ago · 8 · agent research workflow inference

Deep technical discussion on calibration vs. accuracy in LLM-based agents, drawing from Google research on hallucination reduction. Author shares practical patterns for reducing hallucinated tool calls (25% to 5%) using a planning-verification pipeline with confidence-based human review routing, while analyzing the latency-safety tradeoff and the gap between current agent frameworks and confidence-aware control surfaces.