Artificial Analysis in June 2026 released a new benchmark, AA-Briefcase , simulating multi-week knowledge-work projects. NVIDIA's open-weight model Nemotron 3 Ultra ranked among the top open models on long-running agentic tasks.
June 18 · Artificial Analysis
AA-Briefcase puts AI to work for weeks — and most models barely cope
A new benchmark grades models on realistic multi-week knowledge work across 91 tasks and thousands of messy files. Claude Fable 5 leads overall; NVIDIA's open-weight Nemotron 3 Ultra ranks among the top open models.
The hard truth
Even the best models fully solved only a sliver of real knowledge-work tasks.
~3%
of tasks fully solved by top models
91
tasks across 4 expert-built scenarios
1M
token context for Nemotron 3 Ultra
Leaderboard · AA-Briefcase Elo
Column height proportional to Elo · closed in orange, open-weight in green
Claude Fable 5Closed
Claude Opus 4.8Closed
GLM-5.2Open leader
GPT-5.5Closed
Nemotron 3 Ultra · throughput edge
Hybrid Transformer-Mamba design scales linearly at long context. Agentic-workflow throughput vs other open models (NVIDIA settings).
Strengths
Fast and cheap in long-running agent loops
Little slowdown as context accumulates
Holds up across hundreds of tool calls
Open weights, data & recipes published
Limitations
550B MoE demands heavy inference resources
Hundreds of GB VRAM at FP16
Large memory even when quantized
Not top-tier on short tasks
Pricing · DeepInfra API
$0.50 per 1M input tokens · $2.20 per 1M output tokens — 550B total / 55B active params
Continue reading The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.
Already purchased? Sign in ✓ Signed in — this article isn’t included in your current plan.Unlocking the full article…