Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators
Failed to add items
Add to cart failed.
Add to wishlist failed.
Remove from wishlist failed.
Follow podcast failed
Unfollow podcast failed
-
Narrated by:
-
Written by:
About this listen
Apple researchers have introduced on December 2025 Mirror Speculative Decoding (Mirror-SD), an advanced inference algorithm designed to accelerate large language models by overcoming the sequential bottlenecks of standard decoding. Traditional methods are often limited by the time it takes for a small draft model to suggest tokens before a larger target model can verify them. Mirror-SD breaks this barrier by running the draft and target models in parallel across heterogeneous hardware, specifically utilizing both GPUs and NPUs. This system allows the target model to begin verification while the draft model simultaneously predicts multiple future paths. By employing speculative streaming and early-exit signals, the framework effectively hides the latency of draft generation. Experimental results demonstrate that this approach achieves wall-time speedups of up to 5.8x across various tasks without compromising the accuracy of the original model.
Source:
December 2025Mirror Speculative Decoding: Breaking the Serial Barrier in LLM InferenceAppleNikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousovahttps://arxiv.org/pdf/2510.13161