Google releases multi-token prediction drafters for Gemma 4, boosting inference speed up to 3x
Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models, enabling up to a 3x speedup at inference without any degradation in output quality. The technique, called speculative decoding, uses a lightweight drafter model to predict several tokens at once, which the main model then verifies in parallel. This bypasses the one-token-at-a-time bottleneck that limits inference speed on consumer hardware. The drafters share the target model's KV cache, avoiding redundant computation, and are available under the Apache 2.0 license on Hugging Face, Kaggle, and Ollama. The practical upside is significant: a Gemma 4 26B model running on an Nvidia RTX Pro 6000 desktop GPU roughly doubles tokens per second with MTP enabled. On Apple Silicon, batch sizes of 4-8 requests achieve around 2.2x speedups. This makes local AI tasks like coding assistants, voice interfaces, and agentic workflows feel responsive on hardware users already own. The approach contrasts with diffusion-based language models, which have struggled to match traditional transformer quality. For wallet and key holders, this development is relevant as it makes local AI inference more practical, potentially reducing reliance on cloud services. Running models locally can improve privacy and security for sensitive tasks such as transaction analysis or key management, since data does not leave the machine. However, any speed improvement in local AI may also empower malicious actors to run more sophisticated attacks locally, though the broader security implications are neutral and context-dependent. The key takeaway is that hardware efficiency gains continue to expand what is feasible on consumer devices.
关键事实
- Google released MTP drafters for Gemma 4, delivering up to 3x inference speedup.
- Speculative decoding uses a drafter model to predict multiple tokens in parallel.
- Drafters share the target model's KV cache, avoiding redundant computations.
- Available on Hugging Face, Kaggle, Ollama under Apache 2.0 license.
- Real-world speedups: 2x on Nvidia RTX Pro 6000, 2.2x on Apple Silicon.