MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3

A small AI software team called MoonMath AI has released a free, open-source HIP attention kernel for the AMD MI300X chip. They say it runs faster than AMD’s own AITER v3 on every test shape and every rounding mode they tried. The news was reported by MarkTechPost on June 22, 2026. In plain words, MoonMath built a tiny, very fast program that speeds up the core math step inside modern AI models on AMD’s flagship AI chip.

Let us unpack the jargon first, because the headline is full of it.

What the terms actually mean

A kernel here is a small, highly optimized program that runs directly on a GPU (the chip that does the heavy math for AI). Attention is the central math step in modern AI models like chatbots and image generators; it decides which parts of the input the model should focus on. The AMD MI300X is an AI accelerator chip, meaning a special processor built to train and run AI models. It is AMD’s rival to Nvidia’s GPUs.

HIP is AMD’s programming language for writing GPU code (similar to what CUDA is for Nvidia). Open-source means the code is published freely so anyone can read it, use it, or change it. MoonMath released theirs under the MIT license, one of the most permissive open-source licenses, so companies can use it even in commercial products.

AITER v3 is AMD’s own existing library of fast kernels. So MoonMath is claiming its free code beats AMD at AMD’s own game on AMD’s own chip.

What MoonMath actually built

According to the report by MarkTechPost, MoonMath open-sourced a forward attention kernel written in HIP. Notably, it is written in normal HIP code, not hand-tuned assembly (the hardest, lowest-level way to program a chip). That makes it easier for others to read and learn from.

The kernel performs the standard attention math, often written as softmax(QKᵀ/√d)·V, all fused into one step so the chip does not waste time moving data around. It uses bf16 (bfloat16), a compact 16-bit number format that AI models use to save memory and run faster. It targets the MI300X’s gfx942 instruction set and was tested on bare-metal hardware provided by a cloud provider called HotAille.

What it supports, and what it does not

The kernel fixes the head dimension at 128 and accepts common data layouts (BSHD or BHSD) without needing to rearrange the data. It handles any sequence length, including cross-attention. But it has clear limits for now: no causal mask, no GQA (grouped-query attention), and no variable-length batching. In short, it is fast but focused, not a do-everything tool yet.

The speed numbers

This is where MoonMath’s claim lives or dies. A rounding mode is simply the rule a chip uses to round off tiny number errors, and different rounding modes can change speed. MoonMath tested three of them: RTNE (round to nearest even), RTNA (round to nearest, ties away from zero), and RTZ (truncate toward zero). A speedup of 1.23× means the task finished in less time, here about 23 percent faster.

Here are the per-shape results reported by MarkTechPost. The shape is written as (Batch, Heads, Sequence length, Dimension), and times are in milliseconds (ms), so lower is better.

Shape (B,H,S,D)	Rounding	MoonMath (ms)	AITER v3 (ms)	Speedup
(2,24,8192,128)	RTNE	3.083	3.792	1.23×
(2,24,16384,128)	RTNE	11.670	14.691	1.26×
(4,16,16384,128)	RTZ	15.055	16.183	1.07×
(2,24,32768,128)	RTNA	44.440	52.363	1.18×
(1,16,131072,128)	RTNE	232.517	269.278	1.16×

Across many shapes, the geometric mean (a fair “average”) speedups over AITER v3 were 1.18× in RTNE mode, 1.15× in RTNA mode, and 1.08× in RTZ mode. Against another rival, Modular’s MAX, MoonMath reported a bigger gap of 1.44× to 1.49× on average, and up to 1.59× on a single shape.

Why faster is not enough: the accuracy promise

Speed means little if the answers drift. MoonMath says its kernel is not just fast but also numerically careful. All three rounding modes match AITER’s own per-mode rules. Every finite output lands within 1 bf16 ULP of AITER (a ULP is the smallest possible gap between two nearby numbers, so this is as close as it gets). It even handles NaN and Inf (special “not a number” and “infinity” values) bit-identically, and the results are deterministic, meaning you get the same answer every time.

The team also tested it in a real workload: an image-generation pipeline (SGLang diffusion running the Wan2.1-T2V-1.3B model). End to end, they reported a 1.23× speedup on the MI300X with no visible quality drop.

How developers use it

The kernel installs through pip (Python’s package installer) as moonmath_attention. It launches on the caller’s stream so it can overlap with other work in a pipeline, and it accepts simple settings for layout and rounding mode.

Why it matters (especially for India and founders)

Most AI work today runs on Nvidia chips, which are costly and often hard to buy. AMD’s MI300X is one of the few serious alternatives. But hardware alone is not enough; chips need fast, reliable software to be useful. Every fast kernel makes AMD a more practical choice and slowly chips away at one company’s grip on the market.

For Indian startups and founders, this is good news on cost. Running AI models is expensive, and a 15 to 25 percent speedup on the same hardware can directly cut compute bills. Because the code is MIT-licensed and open, a small Indian team can plug it in for free, with no permission and no royalty. It also shows that a tiny team can outperform a chip giant’s own library, which is an encouraging signal for lean engineering teams everywhere.

Key facts

Item	Detail
Who	MoonMath AI
What	Open-source bf16 forward attention kernel in HIP
Target chip	AMD MI300X (gfx942)
License	MIT (free, commercial-friendly)
Beats	AITER v3 on every tested shape and rounding mode
Avg speedup vs AITER v3	1.18× (RTNE), 1.15× (RTNA), 1.08× (RTZ)
Vs Modular MAX	1.44×–1.49× avg, up to 1.59×
Accuracy	Within 1 bf16 ULP of AITER; deterministic
Reported	MarkTechPost, June 22, 2026

FAQ

What is an attention kernel in simple terms?

It is a small, fast program that runs the attention step, the core math that lets an AI model decide what to focus on. A faster kernel means the whole model runs quicker and cheaper.

Is the MoonMath kernel free to use?

Yes. It is open-source under the MIT license, so anyone can use, change, and ship it, even in paid products. It installs via pip as moonmath_attention.

Does the speed come at the cost of accuracy?

MoonMath says no. Its outputs stay within 1 bf16 ULP of AITER’s, match the same rounding rules, handle NaN and Inf identically, and are deterministic. In a real image-generation test it gave a 1.23× speedup with no visible quality loss.

What are the kernel’s main limits?

For now it has no causal mask, no GQA, and no variable-length batching. The head dimension is fixed at 128. So it is fast and accurate, but focused on specific use cases rather than every scenario.

The takeaway

MoonMath AI’s open-source HIP attention kernel shows that AMD’s MI300X can run a key AI math step faster than AMD’s own AITER v3, while staying accurate to the last bit. It is free, MIT-licensed, and already proven on a real workload. For anyone building AI on a budget, including India’s growing pool of founders, more strong open software on AMD hardware means more choice and lower bills. The push to chip away at one company’s dominance, much like the broader race in AI models and memory chips, keeps gathering pace.

Sources

MarkTechPost — MoonMath AI open-sources a HIP attention kernel for AMD MI300X that beats AITER v3 on every shape and rounding mode

Related coverage

Vibecoding Is Becoming a Deal-Breaker Test in Software Acquisitions