Sakana AI Launches Sakana Fugu: A Model That Routes Tasks Across a Swappable Pool of LLMs
Japan’s Sakana AI has launched a new kind of AI system called Sakana Fugu. Instead of being one model that answers everything, Sakana Fugu acts like a smart manager. It routes each task to the best AI model in a pool, then combines the results into one answer. An LLM, or large language model, is the type of AI that powers chatbots and coding tools. Fugu can call many frontier LLMs, and even call copies of itself.
This is a fresh idea. Most AI products pick one model and stick with it. Sakana Fugu treats models like a team. It decides who should handle each job, checks their work, and merges the best parts. Here is how it works and how it scores against top rivals.
What Sakana Fugu Actually Is
Sakana Fugu is an “orchestration model.” Orchestration means coordinating many parts to work together, like a conductor leading a band. Fugu sits behind a single API endpoint. An API is a doorway that lets apps talk to the AI. Its endpoint works just like OpenAI’s, so developers can plug it in easily.
Here is the clever part. Fugu is itself a language model, but it is trained to call other models. It manages four things on its own: choosing which model to use, handing off the task, checking the answer, and combining results. It does this without fixed rules or hard-coded roles. It learns the best way to delegate.
The “swappable pool” matters too. The set of models behind Fugu can be changed. So as better models appear, Fugu can use them without a full rebuild.
Two Versions: Fugu and Fugu Ultra
Sakana offers two versions for different needs.
- Fugu — Balances good performance with low latency. Latency is the delay before you get an answer, so low latency means faster replies. It lets you opt out of specific models, which helps with privacy or compliance rules. It suits coding, code review, and chatbots.
- Fugu Ultra — Tuned for the highest quality on hard, multi-step problems. It uses a fixed pool of models with no opt-out. Its current model ID is
fugu-ultra-20260615.
Benchmarks & Specs
A benchmark is a standard test used to compare AI models. Higher scores are generally better. Sakana reports the scores below against top rivals: Anthropic’s Opus 4.8, Google’s Gemini 3.1 Pro, and OpenAI’s GPT 5.5. These figures are as reported by Sakana AI.
| Benchmark (what it tests) | Fugu | Fugu Ultra | Opus 4.8 | Gemini 3.1 Pro | GPT 5.5 |
|---|---|---|---|---|---|
| SWE Bench Pro (real coding fixes) | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 (terminal tasks) | 80.2 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench (live coding) | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 |
| LiveCodeBench Pro | 87.8 | 90.8 | 84.8 | 82.9 | 88.4 |
| Humanity’s Last Exam (hard reasoning) | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 |
| CharXiv Reasoning | 85.1 | 86.6 | 84.2 | 83.3 | 84.1 |
| GPQA-D (graduate science Q&A) | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 |
| SciCode (science coding) | 60.1 | 58.7 | 53.5 | 58.9 | 56.1 |
| τ³ Banking (finance tasks) | 21.7 | 20.6 | 20.6 | 8.4 | 20.6 |
| Long Context Reasoning | 74.7 | 73.3 | 67.7 | 72.7 | 74.3 |
| MRCRv2 (long-context recall) | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 |
What it means: Fugu Ultra leads on most coding and reasoning tests, while GPT 5.5 still tops one long-context recall test (MRCRv2). In short, routing tasks across many models can beat any single model on a wide range of jobs.
Real-World Test Results
Sakana also shared results from hands-on tasks. These show Fugu working on real problems, not just exam-style tests.
| Task | Result (as reported) |
|---|---|
| AutoResearch experiments | Best mean validation BPB of 0.9774 across 123 experiments in ~14 hours on one H100 GPU |
| Rubik’s Cube solver | Solved all 300 held-out cubes, averaging 19.72 moves |
| Classical Japanese kana reading | Normalized edit distance (NED) of 0.80 |
| Online trading test | +19.43% average across five 50-week runs |
The system is built on research from two ICLR 2026 papers, called “Trinity” and “Conductor,” which study how to learn good orchestration strategies. Managing how many models talk to each other is partly a memory problem; for a deeper look, see this technical guide to the types of agent memory.
How to Use It
Fugu is available through an OpenAI-compatible API at console.sakana.ai. It works with the standard Python OpenAI client. That means many teams can try it with only small changes to their existing code.
The launch drew mixed early reactions. In a manual review of 12 public posts on June 22, 2026, Sakana noted 3 were supportive, 6 were skeptical, and 3 were critical. So the idea is exciting, but some experts want to see more proof in daily use.
FAQ
What does “orchestration model” mean?
It means a model that coordinates other models. Like a conductor leading a band, Fugu decides which AI should handle each task, checks the work, and combines the results into one answer.
How is Fugu different from a normal chatbot?
A normal chatbot uses one model for everything. Fugu uses a pool of models and routes each task to the best one. It can even call copies of itself for complex jobs.
What is the difference between Fugu and Fugu Ultra?
Fugu is faster and lets you turn off certain models for privacy. Fugu Ultra aims for top quality on hard problems but uses a fixed set of models with no opt-out.
Why it matters (especially for India / founders)
For founders, Fugu points to a smarter way to build with AI. Instead of betting on one model, you can use a system that always picks the best tool for the job. As models improve, a swappable pool keeps you up to date without a costly rebuild.
The opt-out feature is useful for Indian teams that handle sensitive data. You can block certain models to meet privacy or compliance needs. And because Fugu uses an OpenAI-style API, Indian startups can test it quickly with their current code.
The takeaway
Sakana Fugu turns many AI models into one coordinated team. It routes tasks, checks answers, and combines results, and its reported benchmark scores beat top single models on many tests. The early reaction is mixed, but the idea is powerful. If it holds up in real use, “orchestration” could become a common way to build AI products. To see how the wider sector is spending, read about how the FAA is betting $875 million to cut flight delays.