K

KeyAudit

· ·infrastructure·mev

Xiaomi Hits 1,000 Tokens/Second on Trillion-Parameter Model Using Commodity GPUs

Xiaomi, in partnership with inference partner TileRT, has achieved over 1,000 tokens per second (peaking near 1,200) on its MiMo-V2.5-Pro-UltraSpeed model, a trillion-parameter flagship. This milestone was reached using a standard 8-GPU commodity node, not custom chips, differentiating it from competitors like Cerebras and Groq who rely on specialized hardware. The speed is driven by two key techniques: FP4 quantization applied only to expert layers (reducing memory footprint while maintaining near-zero quality loss), and DFlash speculative decoding, which proposes a full block of tokens in one pass instead of one at a time. The inference engine TileRT keeps the pipeline continuously resident in GPU, avoiding overhead. In coding benchmarks, MiMo-V2.5-Pro matches Claude Opus, yet UltraSpeed delivers roughly 10x the generation speed at 3x the standard MiMo rate. A limited API trial runs June 9-23, with priority for enterprise developers. The FP4-DFlash checkpoint is open-sourced on Hugging Face. This breakthrough changes the economics of deploying large-scale AI inference on commodity hardware.

Key facts

  • Xiaomi's MiMo-V2.5-Pro-UltraSpeed hits >1,000 tokens/sec on a trillion-parameter model using standard 8-GPU node.
  • FP4 quantization on expert layers reduces memory footprint with near-zero quality loss.
  • DFlash speculative decoding proposes an entire block of tokens in a single pass, accepting avg 6.3/8 tokens.
  • TileRT inference engine keeps compute pipeline continuously resident in GPU, eliminating overhead.
  • Limited API trial June 9-23; FP4-DFlash checkpoint open-sourced on Hugging Face.

KeyAudit data perspective

📊 KeyAudit data: TON historical leak records: 672146

← Back to list