Google Releases DiffusionGemma: Open-Weight Model Hits 1,000 Tokens/Sec but Faces Runtime Gaps
Google has released DiffusionGemma, a free open-weight language model that generates 256 tokens in parallel using text diffusion, achieving over 1,000 tokens per second on an NVIDIA H100—four times faster than standard autoregressive models. The model starts with random placeholder tokens and iteratively refines them into coherent text, enabling bidirectional attention that excels at tasks like code infilling and structured output. Google fine-tuned a version to solve Sudoku with 80% accuracy, compared to 0% for the base model. However, running DiffusionGemma locally is challenging. It requires a custom drafter module for speculative decoding, which isn't available in public runtimes like mlx-lm or LM Studio. Additionally, while the model supports up to 256K tokens of context, NVIDIA NIM defaulted to 8,192 tokens, causing failures with agentic frameworks like Hermes Agent that require at least 64,000 tokens. The model is aimed at developers with high-end GPUs (RTX 4090/5090) building real-time tools, and researchers exploring bidirectional generation. Community toolchain updates are expected in the coming days.
Key facts
- DiffusionGemma generates 256 tokens in parallel, hitting over 1,000 tokens/sec on NVIDIA H100.
- It uses text diffusion with bidirectional attention, improving code infilling and structured outputs.
- Google fine-tuned a version for Sudoku, achieving 80% accuracy vs 0% for the base model.
- Model requires a custom drafter module not available in mlx-lm, LM Studio, or other runtimes.
- NVIDIA NIM defaults to 8,192 token context, blocking agentic frameworks like Hermes Agent.
- Target audience: developers with RTX 4090/5090 GPUs and researchers in bidirectional generation.