News Nug

Simplex rethinks software development with Codex

OpenAI Blog · 34d ago · 5 · workflow api update

Simplex integrates ChatGPT Enterprise and Codex to accelerate software development cycles across design, build, and testing phases. The focus is on organizational workflow scaling rather than new technical capabilities or tools developers can directly adopt.

Weights & Biases New Master Service Agreement Questions [D]

r/MachineLearning · 35d ago · 6 · api update deployment

Weights & Biases updated their Master Service Agreement with concerning changes to data ownership and usage rights—specifically removing explicit customer ownership statements and adding provisions allowing W&B to use customer data (including ML models and training logs) for product development and AI feature training without clear opt-out mechanisms. For engineers shipping with wandb for experiment tracking and model management, this represents a potential IP and data governance issue worth understanding before the May 11th effective date.

May 6, 2026AnnouncementsHigher usage limits for Claude and a compute deal with SpaceX

Anthropic Blog · 35d ago · 7 · api update deployment inference

Anthropic has doubled Claude Code rate limits, removed peak hour restrictions for Pro/Max users, and significantly increased Claude API rate limits for Opus models, backed by new compute capacity including a SpaceX partnership providing 220,000+ NVIDIA GPUs. Engineers using Claude API and Code should review the updated rate limits table to understand new quotas for their applications.

ZAYA1-8B: Frontier intelligence density, trained on AMD

r/LocalLLaMA · 35d ago · 8 · new model open source benchmark inference

Zyphra released ZAYA1-8B, a new 8B parameter MoE model trained on AMD hardware that achieves strong performance on reasoning, math, and coding tasks while using <1B active parameters. The model features novel architectural innovations (Compressed Convolutional Attention, MLP-based routing, learned residual scaling) and a Markovian-RSA test-time compute methodology, available as a serverless endpoint on Zyphra Cloud.

vLLM V0 to V1: Correctness Before Corrections in RL

HuggingFace Blog · 35d ago · 7 · inference research workflow

Technical deep-dive on migrating vLLM from V0 to V1 for online RL workloads (GSPO/PPO), covering critical fixes for logprob processing, runtime defaults, weight updates, and prefix caching behavior that affected training convergence. The post provides practical debugging methodology for inference engine parity testing in RL systems.

opensquilla — OpenSquilla — Token-Efficient AI Agent with same budget, higher intelligence density

GitHub Trending AI · 35d ago · 7 · tool agent open source deployment

OpenSquilla is a token-efficient AI agent framework that supports multi-provider LLM routing (OpenAI, Anthropic, Ollama, etc.) with built-in features like persistent memory, web search, and local embeddings. It offers flexible deployment options including Web UI, CLI, and chat integrations, making it practical for engineers building AI applications with cost optimization in mind.

Vibe coding and agentic engineering are getting closer than I'd like

Simon Willison · 35d ago · 7 · workflow prompt engineering agent

Simon Willison discusses the blurring line between 'vibe coding' (non-programmer AI-assisted coding without quality concern) and 'agentic engineering' (professional developers leveraging AI tools while maintaining code standards), noting that as AI coding agents become more reliable, even experienced engineers are reviewing less code while maintaining production quality. The key insight is that modern AI coding tools are enabling engineers to tackle significantly larger scope of challenges while maintaining or improving code quality, fundamentally changing the engineering paradigm.

tokenspeed — TokenSpeed is a speed-of-light LLM inference engine.

GitHub Trending AI · 35d ago · 8 · tool inference open source

TokenSpeed is a new high-performance LLM inference engine optimized for agentic workloads, combining TensorRT-LLM-level performance with vLLM-level usability. Currently in preview release, it demonstrates competitive results on modern hardware (B200) but is not yet production-ready, making it worth tracking for its runtime design innovations.

AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields

DeepMind Blog · 35d ago · 7 · agent research benchmark

AlphaEvolve, Google's Gemini-powered coding agent for algorithm design, has demonstrated significant real-world impact across domains including genomics (30% error reduction in DNA sequencing), power grid optimization (88% improvement in feasibility), and quantum computing (10x error reduction). The system represents a practical advancement in AI-assisted algorithm optimization that engineers building with LLMs should understand as a reference implementation of agentic problem-solving.

[AINews] Silicon Valley gets Serious about Services

Latent Space · 35d ago · 6 · agent deployment workflow

Anthropic and OpenAI are launching services companies ($1.5B and $4B funded respectively) to handle enterprise deployment and system integration of AI agents, recognizing that model capability alone isn't sufficient—organizations need help with context management, workflow modernization, and adoption. This signals a shift toward "last-mile" services businesses as AI labs recognize opportunities in the operational work required to integrate agents into real business processes.

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

r/LocalLLaMA · 35d ago · 8 · open source security inference

Critical vulnerability (CVE-2026-7482, CVSS 9.1) discovered in Ollama enabling unauthenticated memory leaks from the Ollama process, potentially exposing user prompts, system prompts, and environment variables across 300,000+ servers. The article provides technical details on Ollama's API architecture and how the vulnerability works through the /api/create and /api/blobs endpoints.

Transformers with Selective Access to Early Representations [R]

r/MachineLearning · 35d ago · 7 · research architecture inference

SATFormer introduces a more efficient alternative to recent Transformer variants by replacing static cross-layer pathways with per-token, per-head gating that selectively reuses first-layer representations. The method achieves better efficiency-performance tradeoffs (1.75-1.82× higher throughput than competitors) while improving validation loss at 130M-1.3B scale and showing strong results on retrieval-intensive tasks.

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

HuggingFace Blog · 35d ago · 6 · benchmark research open source

The Open ASR Leaderboard now includes private high-quality English speech datasets from Appen and DataoceanAI to prevent benchmark gaming while maintaining standardized evaluation metrics. The leaderboard has reached 710K visits since launch and emphasizes the importance of standardization and openness in benchmarking, with optional toggles to see private dataset impact on model performance.

MTP on strix halo with llama.cpp (PR #22673)

r/LocalLLaMA · 36d ago · 5

Competition - League of Robot Runners 2026: Multi-robot coordination under uncertainty [N]

r/MachineLearning · 36d ago · 6 · agent benchmark

League of Robot Runners (LoRR) 2026 is a research competition focused on large-scale multi-robot coordination using ML/RL methods for task scheduling and path planning under uncertainty. The competition provides starter kits in C++/Python, automated evaluation with live leaderboards, and welcomes diverse technical approaches including RL, search, optimization, and hybrid techniques.

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI

Latent Space · 36d ago · 6 · prompt engineering workflow research

Article explores the 'Jagged Frontier' concept where modern LLMs like GPT-5 show dramatic capability improvements at research/science frontiers while appearing incremental for everyday tasks. Features physicist Alex Lapskasky using AI (o3/GPT-5) to accelerate theoretical physics research, reproducing complex papers in minutes through prompt engineering techniques like 'priming' with textbook problems.

May 5, 2026AnnouncementsAgents for financial services

Anthropic Blog · 36d ago · 8 · agent tool api update deployment

Anthropic released 10 pre-built agent templates for financial services workflows (pitchbooks, KYC screening, month-end closing) deployable as Claude plugins or managed agents, plus native integrations with Microsoft 365 apps and expanded MCP/connector ecosystem for real-time data access. The templates package skills, data connectors, and subagents as reference architectures that teams can adapt and deploy in days, with Claude Opus 4.7 achieving 64.37% on Vals AI's Finance Agent benchmark.

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

r/LocalLLaMA · 36d ago · 7 · benchmark inference research

Comprehensive benchmark comparison of Qwen3.6 vs Qwen3.5 27B and Gemma 4 31B across accuracy, latency, and token efficiency metrics, with extended analysis on thinking-enabled modes. Results show Qwen3.6 excels on math/knowledge tasks but underperforms on instruction-following and some reasoning benchmarks, revealing task-specific trade-offs for practitioners choosing between models.

Production AI very different from the demos [D]

r/MachineLearning · 36d ago · 7 · deployment workflow

A software engineer shares production cost management challenges with LLM APIs, specifically difficulty tracking token usage and costs across features when moving from prototypes to scaled deployments. The core issue is lack of cost attribution granularity—OpenAI dashboards provide total spend but not per-feature breakdown, requiring manual reconciliation that doesn't scale and lacks confidence.

TritonSigmoid: A fast, padding-aware sigmoid attention kernel for GPUs [R]

r/MachineLearning · 36d ago · 8 · library open source inference benchmark

TritonSigmoid is an open-source GPU kernel implementing sigmoid attention with native padding awareness, achieving 515 TFLOPS on H100 and outperforming softmax/FlashAttention on variable-length sequences. Designed for single-cell biology models where multi-token attention is semantically required, it demonstrates both computational efficiency and empirical improvements in loss and representation quality across benchmarks.