New AA-Briefcase Benchmark Exposes How Badly AI Struggles With Real Knowledge Work
There is a new test for AI. An AI model is a computer program that learns from lots of data and can answer questions or do tasks. The test is called AA-Briefcase. A benchmark is a standard test used to score and compare different AI models. This new test shows that even the best AI models are very bad at real office work. A firm named Artificial Analysis built the test. They shared the results on June 19, 2026. The big finding is simple: the best model fully finished only 3 out of every 100 tasks.
Knowledge work means the thinking jobs people do at a desk. This includes research, planning, and writing reports. The test was made to feel like that real work.
What the AA-Briefcase benchmark tests
AA-Briefcase does not ask quick, easy questions. Instead, it gives an AI model big projects that take many weeks.
The facts are hidden in messy, real-world places. This includes Slack chats, emails, notes from meetings, and big files full of data. (Slack is a chat app many offices use to talk.) The model must find the facts in all of them and join them together.
This is hard because no single place has the full answer. The model must link clues across many files. That is exactly what people do at work.
The results: almost nobody passes
The test has 91 tasks in all. On 31 of them, no model could even get half of the points. A pass rate is the share of a task that a model gets right. So a 50 percent pass rate means getting half of it right. This shows how hard the test is.
The best one was Claude Fable 5, made by a company called Anthropic. Even so, it fully finished only 3 percent of all the tasks. That is a shocking number for a top model.
Benchmarks & specs
| Metric (as reported) | Figure |
|---|---|
| Benchmark name | AA-Briefcase |
| Creator | Artificial Analysis |
| Total tasks | 91 |
| Tasks where no model passed 50% | 31 |
| Top model | Claude Fable 5 (Anthropic) |
| Top full-solve rate | 3% |
| Lowest-cost model | DeepSeek V4 Flash |
| Cost range per task | ~$0.04 to $31+ (800x gap) |
| Date | June 19, 2026 |
What it means: Even the best AI model fully finishes only a tiny part of real office projects. And the price to run the same task can change by 800 times.
Why models fail in different ways
The test found that weak and strong models fail for different reasons.
Weak models often fail at the simple steps. They cannot even do the easy parts of the job.
Strong models do the easy parts well. But they miss small details. Those details need facts joined from many places at once. Pulling facts from many sources and tying them together is called synthesis. That is the real wall they hit.
The huge cost gap
These tasks do not all cost the same to run. The price per task went from about $0.04 to more than $31. That is an 800 times difference.
DeepSeek V4 Flash was the cheapest one in the test. A cheap model is not always far behind on these very hard tasks. That is because almost every model did badly.
Why it matters (especially for India and founders)
Many founders want to hand whole jobs to AI agents. An AI agent is an AI that tries to do a full task on its own. This test is a clear wake-up call. On real work that takes weeks, even the best model finishes almost nothing fully.
Indian startups watch their costs closely. So the 800 times price gap is a big deal. A cheap model like DeepSeek V4 Flash may give similar results on hard tasks for far less money. It is smart to test models before paying for the most costly one.
The best plan is to use AI for parts of a job, not the whole job. And always keep a person in the loop to check the work.
FAQ
What is the AA-Briefcase benchmark?
It is a test from Artificial Analysis. It checks how well AI models handle real work that takes weeks. The work uses messy sources like emails and Slack chats.
Which model did best?
Claude Fable 5 from Anthropic did best. But it still fully finished only 3 percent of the tasks.
How big was the cost difference?
The price per task went from about $0.04 to more than $31. That is an 800 times gap. DeepSeek V4 Flash was the cheapest.
Why is this work so hard for AI?
The answers are spread across many sources. Even strong models miss details. Those details need facts joined from several files at once.
Takeaway
AA-Briefcase shows that AI cannot fully run office work yet. The best model finishes only 3 percent of real tasks. And the costs jump up and down a lot. For now, AI is a strong helper, not a full replacement. Smart teams will test cheaper models before paying top dollar.
Source: The Decoder