Inception Labs Launches Mercury 2 Diffusion LLM: 1,000 Tokens/Second, 90% on AIME
Inception Labs introduced Mercury 2 on Thursday, claiming it's the world's fastest reasoning language model. It generates approximately 1,000 tokens per second, vastly outpacing Anthropic's Claude Haiku 4.5 Reasoning (89 tokens/s) and OpenAI's GPT-5 Mini (71 tokens/s). This speed compares to Google's DiffusionGemma. Both models use diffusion techniques: they fill a block of text with random placeholder tokens and iteratively remove noise, unlike sequential models that write one token at a time. However, Mercury 2 significantly outperforms DiffusionGemma on key benchmarks. On AIME 2026, it scored 90% vs. DiffusionGemma's 69.1%. On GPQA, they nearly tie (77% vs. 73.2%). Google's own documentation notes DiffusionGemma trails standard Gemma 4. Augment Code reported an 82% latency reduction and 90% cost cut when switching to Mercury 2. Inception Labs was founded by Stanford professor Stefano Ermon, a pioneer in score-based diffusion techniques. The startup raised $50 million from Nvidia's venture arm, Andrew Ng, and Andrej Karpathy. While promising for speed-sensitive tasks, Mercury 2 is a closed-weight API model, and the ecosystem for diffusion LLMs is still maturing.
Key facts
- Mercury 2 generates ~1,000 tokens/s, far faster than Claude Haiku 4.5 (89/s) and GPT-5 Mini (71/s).
- On AIME 2026, Mercury 2 scores 90% vs DiffusionGemma's 69.1%.
- Augment Code reports 82% latency drop and 90% cost cut using Mercury 2.
- Mercury 2 is a closed-weight API model; DiffusionGemma is open-weight on Hugging Face.
- Inception Labs raised $50M from Nvidia, Andrew Ng, and Andrej Karpathy.