Bengaluru-based AI startup Sarvam AI launched Sarvam Vision, a 3-billion-parameter multimodal model that has sent shockwaves through the industry by outperforming global giants like Gemini 3 Pro and GPT-5.2 in specific document intelligence tasks.
While Gemini 3 remains a more powerful general-purpose assistant, Sarvam Vision has claimed the crown for Indic Document Intelligence, specifically in the complex task of Optical Character Recognition (OCR) for Indian languages.

1. Performance Benchmarks: The “Indic” Edge
Sarvam Vision was built specifically to solve “knowledge locked in plain sight”—the billions of scanned government records, historical archives, and handwritten notes in India that global models often struggle to parse accurately.
| Benchmark | Sarvam Vision | Gemini 3 Pro | ChatGPT (GPT-5.2) |
| olmOCR-Bench (Accuracy) | 84.3% | 81.2% | ~76.0% |
| OmniDocBench v1.5 | 93.28% | 89.5% | 85.1% |
| Indic OCR Support | 22 Languages | Secondary Support | Limited |
- Complex Layouts: Sarvam Vision excels at interpreting visual elements that typically break OCR tools, such as nested tables, trend lines in charts, and mathematical formulas within scanned PDFs.
- Accuracy vs. Scale: Despite being a compact 3B-parameter model, its specialized training on trillions of Indic tokens allows it to deliver higher fidelity for regional scripts (Hindi, Tamil, Telugu, etc.) than the much larger, English-centric frontier models.
2. Technical Innovation: State-Space Architecture
Unlike the standard Transformer architecture used by most competitors, Sarvam Vision utilizes a State-Space Model (SSM) backbone.
- Inference Efficiency: The 3B state-space model is significantly faster and cheaper to run than larger models, making it viable for mass-scale government and enterprise digitization projects.
- Pixel-Level Understanding: Sarvam describes the model as a “knowledge extraction” engine rather than just a text extractor; it “attends to every pixel” to understand the visual logic holding a document together.
3. Part of a “Sovereign AI” Blitz
Sarvam Vision is just one part of a massive February 2026 launch series by the company. Other tools released alongside it include:
- Bulbul V3: A text-to-speech model that recently won a third-party blind listening study against ElevenLabs, offering more natural-sounding Indic voices at a fraction of the cost.
- Sarvam Dub: An AI system that was used to live-dub the Union Budget 2026 into regional languages for real-time broadcast.
- Sarvam Audio: A speech-to-text model that outperforms Gemini 3 Flash and GPT-4o in transcribing “code-mixed” (Hinglish) conversations.
4. Why This Matters for 2026
The success of Sarvam Vision validates the “Sovereign AI” strategy: that smaller, hyper-localized foundational models can beat “General AI” giants in specialized domains. By providing native support for all 22 official Indian languages, Sarvam is positioning itself as the primary infrastructure layer for India’s digital public goods.
Conclusion: A New Baseline for India
While you might still use Gemini 3 for creative writing or complex coding, Sarvam Vision is now the global gold standard for anyone needing to digitize an Indian tax record or an ancient Malayalam manuscript. For the month of February 2026, the company has made the Document Intelligence APIs free to encourage developers to test this new “Indic-First” reality.


