Sakana AI Fugu: A Multi-LLM System That Matches Anthropic’s Frontier Models

A Japanese company called Sakana AI has made a new AI tool named Fugu. It works in an unusual way. Most companies build one giant AI model. Sakana did the opposite. Sakana AI Fugu takes many AI models that already exist and makes them work together as a team.

The company says this team beats or matches the best AI tools you can buy. That includes the top models made by a company called Anthropic. These top models are called “frontier models.” A frontier model just means one of the most advanced AI models that exists right now.

The idea is simple and clever. Why build one huge brain when you can join many smaller brains? Each one can help cover the others’ weak spots. Below we explain what Fugu is, how it works, what its test scores show, and what the downsides are.

What is Sakana AI Fugu?

First, two quick words. An LLM (a large language model) is the kind of AI that runs chatbots and coding helpers. It reads words and writes words back. A benchmark is a standard test that scores and compares these models. Think of it like a school exam.

Fugu is one system that runs many AI models at the same time. But to you, it looks and feels like a single model. You send one request through one API. (An API is just a way for one piece of software to talk to another.) Then Fugu decides what to do on its own, behind the scenes.

The group of models it uses is “swappable.” That means Sakana can add, remove, or change which models it uses. It can do this without breaking the service.

There are two versions. There is a normal Fugu and a stronger one called Fugu Ultra. The report says Fugu launched on June 22, 2026. It started with about 500 beta users. (Beta users are early testers who try a product before everyone else can.)

How does multi-LLM orchestration work?

Orchestration means guiding many parts so they work together well. Picture a conductor leading a music band. An ensemble is a close idea. It means joining several models so the group does better than any one model alone. Fugu uses both ideas.

Here is the simple version of how it works. When a job comes in, Fugu runs the whole process inside itself:

  • Selection: it picks which model or models should do the job.
  • Delegation: it can hand the work to a special model, or even to a copy of itself.
  • Checks: it looks over the answers it gets back.
  • Synthesis: it mixes those answers into one final reply. (Synthesis just means joining parts into one whole.)

All of this runs through one API that works like OpenAI’s. In plain words, developers can plug Fugu in much like they plug in other popular AI tools. One thing to note: the Anthropic models that Fugu is compared to are not part of Fugu’s own team. The report says they are not open for this kind of use.

The benchmark results

The reported scores show Fugu and Fugu Ultra going back and forth with the best single models. The tests check coding, science questions, thinking, and reading long documents. In every row, a higher number is better. The numbers below are copied exactly as the report published them. Nothing was added or guessed.

Benchmarks & specs

Benchmark / specFuguFugu UltraAnthropic Opus 4.8Gemini 3.1 ProGPT 5.5
SWE Bench Pro59.073.769.254.258.6
TerminalBench 2.180.282.174.670.378.2
LiveCodeBench92.993.287.888.585.3
LiveCodeBench Pro87.890.884.882.988.4
Humanity’s Last Exam47.250.049.844.441.4
CharXiv Reasoning85.186.684.283.384.1
GPQA-D95.595.592.094.393.6
SciCode60.158.753.558.956.1
τ³ Banking21.720.620.68.420.6
Long-Context Reasoning74.773.367.772.774.3
MRCRv286.693.687.984.994.8
ApproachMulti-LLM orchestrationMulti-LLM orchestrationSingle modelSingle modelSingle model
Context windowNot disclosedNot disclosedNot disclosedNot disclosedNot disclosed
ModalitiesNot disclosedNot disclosedNot disclosedNot disclosedNot disclosed
Availability / pricingSubscription plans plus usage-based billing (prices not disclosed)Subscription plans plus usage-based billing (prices not disclosed)Not disclosedNot disclosedNot disclosed

What it means: on most of the reported tests, Fugu (and Fugu Ultra most of all) lands at or above the best single models. This backs up Sakana’s claim that joining many LLMs can reach top-level results. But a few rows, like MRCRv2, still favor a single model.

The trade-offs

The scores look great, but the report points out real open questions. The biggest one is cost. When you ask several models to answer one question, you may use a lot more computing power than one model would. Inference is the word for running a trained model to get an answer. And every time it runs, it costs money. The report says Sakana has not explained how many tokens Fugu uses or how much it costs to run. (Tokens are the small chunks of text that AI models read and write, and they are how the cost is counted.) So we do not yet know if the team approach is really cheaper.

A few other facts are just not public yet. The context window is not shared. (The context window is how much text the system can read at one time.) The modalities are not shared either. (Modalities are the types of input it can take, like text, images, or audio.) Exact prices are not given. We only know there will be subscription plans for daily use and usage-based billing for bigger jobs. (Usage-based billing means you pay more the more you use it.)

FAQ

What is Sakana AI Fugu in one line?

It is a tool from Sakana AI that runs several existing AI models as one service. The goal is to match the best single models on tests.

How is Fugu different from a normal LLM?

A normal LLM is just one model. Fugu is like a manager. It can call many models, check their work, and join their answers. To you, it still looks like one model.

Does Fugu beat Anthropic’s models?

On many of the reported tests, Fugu and Fugu Ultra score at or above Anthropic’s Opus 4.8. But the results are mixed across tests, and the cost numbers are not shared.

Can developers use Fugu now?

Fugu launched with about 500 beta users. It offers an API that works like OpenAI’s. Subscription and usage-based pricing are planned, but the exact prices were not shared.

Why it matters (especially for India and founders)

For Indian startups and founders, the Fugu idea is good news. Building a brand-new frontier model costs a huge amount of money. It also needs giant data centers. Orchestration offers another way. You take strong models that already exist and make them work together. That makes it easier to build something that can compete. This is great for teams with smart engineers but smaller budgets.

But there is a catch to watch for. If your product calls many models for each request, your costs can climb fast. Indian founders who build on AI should test the cost per task early, not just the quality. The “team of models” path can be powerful. But it only works if the bill stays fair as you grow.

Closing takeaway

Sakana AI Fugu is a clear sign of something new. The next big race in AI may not be only about who builds the biggest single model. Joining many models, sending tasks to the right one, and mixing the results is now a real choice. The reported test scores suggest it can reach top-level results. The big open question is cost. If Fugu can prove it is also cheaper to run, then multi-LLM orchestration could become a normal tool for builders everywhere, including across India.

Reporting and benchmark figures via The Decoder.