Breaking

Artificial Analysis Launches AA-Briefcase, Nemotron 3 Ultra Tops Open Models

June 26, 2026 at 16:19 EDT

Foundation Models
AI Agents
Research & Papers

Artificial Analysis in June 2026 released a new benchmark, AA-Briefcase, simulating multi-week knowledge-work projects. NVIDIA's open-weight model Nemotron 3 Ultra ranked among the top open models on long-running agentic tasks.

June 18 · Artificial Analysis

AA-Briefcase puts AI to work for weeks — and most models barely cope

A new benchmark grades models on realistic multi-week knowledge work across 91 tasks and thousands of messy files. Claude Fable 5 leads overall; NVIDIA's open-weight Nemotron 3 Ultra ranks among the top open models.

The hard truth

Even the best models fully solved only a sliver of real knowledge-work tasks.

~3%

of tasks fully solved by top models

tasks across 4 expert-built scenarios

token context for Nemotron 3 Ultra

Leaderboard · AA-Briefcase Elo

Column height proportional to Elo · closed in orange, open-weight in green

1587

1356

1266

1159

Claude Fable 5
Closed

Claude Opus 4.8
Closed

GLM-5.2
Open leader

GPT-5.5
Closed

Nemotron 3 Ultra · throughput edge

Hybrid Transformer-Mamba design scales linearly at long context. Agentic-workflow throughput vs other open models (NVIDIA settings).

5.9×

vs GLM-5.1

4.8×

vs Kimi K2.6

1×

baseline

Strengths

Fast and cheap in long-running agent loops
Little slowdown as context accumulates
Holds up across hundreds of tool calls
Open weights, data & recipes published

Limitations

550B MoE demands heavy inference resources
Hundreds of GB VRAM at FP16
Large memory even when quantized
Not top-tier on short tasks

Pricing · DeepInfra API

$0.50 per 1M input tokens · $2.20 per 1M output tokens — 550B total / 55B active params

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.