Google’s DiffusionGemma Writes Text From Noise, Not Word by Word

Google has made a new AI model called DiffusionGemma. It works in a very different way from most chatbots you know. It is a text diffusion model. That means it is an AI that builds a whole block of text at once. It does this by cleaning up random noise, instead of writing one word after another. The model came out on June 10, 2026. Anyone can download it and use it. Google built it on top of its Gemma 4 family. Nvidia, a chip company, helped make it run fast.

The big idea here is speed. On the right hardware, DiffusionGemma can be up to 4 times faster than normal AI text models. This is true when just one person is using it. That makes it useful for coding tools, autocomplete, and other jobs where waiting for words feels slow.

What does “text diffusion” actually mean?

Most AI chatbots today are “autoregressive.” That is a big word for a simple habit. They write text one token at a time, from left to right. A token is a small chunk of text, like a word or part of a word. Each new token has to wait for the one before it. That is why a chatbot often types out its answer piece by piece.

DiffusionGemma does not wait. It starts with 256 random placeholder tokens. This is basically just meaningless noise. Then it goes over all of them at the same time, again and again. Each pass slowly cleans up the mess until real, readable text appears. Think of a blurry photo coming into focus. It is not like a sentence being typed letter by letter.

It works on a whole block of 256 tokens together. So every token can “see” every other token, even ones that come later. This is called bidirectional context (the model looks both ways, not just backward). It helps the model do tricky jobs. For example, it can fill a gap in the middle of code or text, not just add to the end.

How big is the model and how fast is it?

DiffusionGemma has 26 billion total parameters. A parameter is one of the millions of settings an AI learns while it is trained. More parameters usually means more knowledge. But the model does not use all of them at once. It uses a “mixture-of-experts” (MoE) design.

MoE means the model is split into many smaller “expert” parts. Only the useful parts turn on for each step. So the model holds 26 billion parameters, but only about 3.8 billion are working at each step. This keeps it fast and cheaper to run. When it is squeezed down to use less memory, it needs about 18 GB of VRAM. (VRAM is the fast memory on a graphics card.) So it can fit on a high-end home GPU (graphics card).

The Decoder, a tech news site, says the speed numbers are strong. On an Nvidia H100 chip, it hits about 1,000 tokens per second for one request. On a powerful DGX Station, it reaches up to 2,000 tokens per second. Even a gaming card, the GeForce RTX 5090, does over 700 tokens per second.

Key facts at a glance

DetailFigure (as reported)
Release dateJune 10, 2026
Total parameters26 billion
Active parameters per step3.8 billion (mixture-of-experts)
VRAM needed (quantized)~18 GB
Block size generated at once256 tokens
Speed on H100 (single request)~1,000 tokens/sec
Speed on DGX Stationup to 2,000 tokens/sec
Speed on RTX 5090over 700 tokens/sec
Speed edge vs autoregressiveup to 4x faster (single user)
LicenseApache 2.0 (open)

Benchmarks and specs: DiffusionGemma vs Gemma 4

A benchmark is a standard test. It is used to score and compare AI models. The Decoder says DiffusionGemma gives up some quality to get its big speed gain. It scored lower than Gemma 4 26B on every benchmark they tested. These tests include MMLU, MMLU Pro, AIME 2026, LiveCodeBench v6, GPQA Diamond, and tau2-bench. (These are just the names of common AI tests.) The exact scores were not shared as numbers. But the result is clear: it is faster, but not as smart.

Specs & benchmarksDiffusionGemmaGemma 4 26B
Generation typeText diffusion (block of 256)Autoregressive (word by word)
Total parameters26 billion26 billion
Active params/step3.8 billion (MoE)Not reported
Generation speed1,107 tokens/sec303 tokens/sec
Speed comparison~3.5x fasterBaseline
Benchmark qualityLower on all testedHigher across the board
LicenseApache 2.0Open (Gemma family)

What it means: DiffusionGemma is about 3.5 times faster (1,107 tokens per second vs 303). But it gives weaker answers. So it fits jobs where speed matters most, not the hardest thinking tasks.

What is it good at, and where does it struggle?

DiffusionGemma is great at non-linear work (work that is not done in a straight, start-to-end line). Because it sees the whole block at once, it is good at inserting text into the middle. It is good at filling code gaps. It also handles structured data well, like amino acid sequences (the building blocks of proteins). In one fun test, a tuned version solved Sudoku puzzles with 100% accuracy. The basic version made 31 mistakes.

The model also has clear weak spots. Its answers are lower quality than Gemma 4. It is slower in the cloud when many people send requests at the same time. Its speed trick works best for just one user. And on systems with limited memory speed, like Apple Silicon Macs, the speed gain gets much smaller.

Where can people get it?

DiffusionGemma is open under the Apache 2.0 license. (A license is the rulebook for how you may use software.) This one lets developers use it freely, even in products they sell. You can download it from Hugging Face, a website that hosts AI models. It works with popular tools like Hugging Face Transformers, vLLM, and MLX. Support for llama.cpp is planned. For fine-tuning (adjusting the model for your own task), it supports JAX’s Hackable Diffusion, Unsloth, and Nvidia’s NeMo Framework. It can also run through Google’s Gemini Enterprise Agent Platform Model Garden and Nvidia NIM.

FAQ

Is DiffusionGemma free to use?

Yes. It is released under the Apache 2.0 license. You can download it from Hugging Face. This open license lets developers use it and change it, even in their own products.

Is it better than normal AI chatbots?

Not in answer quality. It scored lower than Gemma 4 on every test. Its strength is speed for one user. It is also good at filling gaps in code or text.

What hardware do I need to run it?

When it is squeezed down to use less memory, it needs about 18 GB of VRAM. That fits on a high-end home GPU. One example is an Nvidia GeForce RTX 5090, which runs it at over 700 tokens per second.

Why does it work poorly on Apple Silicon?

Its speed gain needs strong memory speed (how fast data moves in and out of memory). On systems with limited memory speed, like Apple Silicon Macs, that gain shrinks. So it does not feel as fast there.

Why it matters (especially for India and founders)

For Indian founders and students, an open and fast model is a real chance. The Apache 2.0 license means a startup can build coding tools, autocomplete features, or developer products. And it does not have to pay license fees. That lowers the cost of trying new ideas.

Speed also helps the user. Faster text means apps feel snappier. It can also mean lower computing bills when you serve one user at a time. The model is good at filling gaps in code too. That is useful for the kind of developer tools many Indian software teams build. But the trade-off is clear. For hard thinking tasks, a stronger model like Gemma 4 may still be the safer pick.

There is a bigger point too. DiffusionGemma shows that text diffusion is moving out of research labs. It is becoming a real, usable open model. If diffusion models keep getting better, they could change how many AI text products are built.

The takeaway: DiffusionGemma is a fast, open AI model. It builds text from noise instead of one word at a time. It is not the smartest model around. But for speed-first and gap-filling jobs, it gives founders and developers a fresh, free option they can test today.

Source: The Decoder

Related coverage