Google officially launched Gemini Embedding 2, its first natively multimodal embedding model. Now available in Public Preview via the Gemini API and Vertex AI, this model marks a massive shift from text-only systems to a unified “multimodal” brain.
Unlike traditional pipelines that require converting images to captions or audio to transcripts before processing, Gemini Embedding 2 understands multiple formats in a single, shared mathematical space
Key Capabilities & Modal Limits
The model allows you to map various data types into a single vector space, enabling complex “cross-modal” searches (e.g., using a text query to find a specific moment in a 2-minute video).
| Modality | Input Limits / Specs |
| Text | Up to 8,192 tokens per request. |
| Images | Processes up to 6 images (PNG/JPEG) in a single request. |
| Video | Supports clips up to 120 seconds (MP4/MOV). |
| Audio | Natively embeds audio without needing a text transcript. |
| Documents | Directly embeds PDFs up to 6 pages long. |
| Languages | Supports semantic intent across 100+ languages. |
Advanced Technical Features
- Interleaved Input: You can now pass a mix of modalities (e.g., an image + a text description) in a single request. This helps the AI understand the “nuanced relationship” between a visual and its context.
- Flexible Dimensionality (MRL): Using Matryoshka Representation Learning, the model allows you to scale the output dimensions down from the default 3,072 to 1,536 or 768.
- Tip: This lets you use smaller vectors for fast, cheap candidate retrieval and full-size vectors only when you need maximum precision.
- RAG & Semantic Search: The model is specifically optimized for Retrieval-Augmented Generation (RAG) and data clustering, simplifying the infrastructure needed for “AI-powered knowledge bases.”
Why This Matters
Previously, developers had to manage separate “file cabinets” (vector indexes) for text and images. If you had a photo of a cat and a text document about cats, the AI might not realize they were the same concept.
Gemini Embedding 2 fixes this: it treats the word “cat,” the sound of a “meow,” and the image of a cat as the same semantic point. This enables a new generation of apps where you can search your entire companyโs video archives, meeting recordings, and PDF manuals using a single search bar.
How to Get Started
The model is listed in the API as gemini-embedding-2-preview. Major vector database partners like Qdrant, ChromaDB, and Weaviate announced day-one support for the new architecture.

