Episodes

  • Parallel Token Prediction: From ProphetNet to Dependent Multi-Token Generation
    Mar 4 2026
    This episode examines the fundamental latency bottleneck in autoregressive language models: sequential token generation requires one full transformer forward pass per output token, leaving GPU parallelism idle during single-user inference. The episode centers on Draxler et al. (5 co-authors, UC Irvine and Chan-Zuckerberg Initiative), whose paper on Parallel Token Prediction landed Christmas Eve 2025 and argues that the independence assumption baked into all prior multi-token schemes is not an acceptable approximation but the actual limiting factor. The paper asks whether multiple tokens can be jointly predicted in a single pass — modeling dependencies among them — without sacrificing the expressiveness that makes autoregressive generation reliable. First author Felix Draxler previously led the Free-form Flows work in 2024, and the normalizing flow machinery he developed there is central to how the paper solves the dependency problem. The episode traces the historical arc carefully. Qi et al. (7 co-authors, Microsoft Research Asia) published ProphetNet in January 2020 — predating GPT-3 by four months — in the encoder-decoder world of seq2seq tasks. Their critique was precise: standard one-step-ahead teacher forcing gives models no incentive to plan ahead, letting local bigram correlations dominate at the expense of long-range coherence. Their answer was n-gram prediction, training the decoder to simultaneously predict tokens at t+1, t+2, and t+3 using parallel heads that did not condition on each other. The independence assumption was already present. When Brown et al. (OpenAI, May 2020) demonstrated that scale and in-context conditioning make the encoder optional, the field shifted to decoder-only architectures — but ProphetNet's core insight migrated cleanly. Gloeckle et al. (FAIR, Meta, April 2024) rebuilt multi-token prediction for decoder-only models using independent output heads, DeepSeek adopted the same approach, and NVIDIA incorporated it into Nemotron 3. The independence assumption migrated with the insight, and Draxler et al. argue that limitation has been compounding ever since. The episode situates Parallel Token Prediction against the two main camps attacking inference latency. Speculative decoding — covered across twenty prior episodes — keeps the model's output distribution unchanged by using a small draft model whose proposals a large verifier checks in one batched pass; the latency gain comes entirely from accepted tokens per step. Multi-token prediction is the other camp: train the model itself to emit several tokens at once, collapsing multiple forward passes into one, at the cost of changed model behavior during training. Draxler et al.'s contribution is showing that jointly predicting dependent tokens, using normalizing flows to capture the conditional structure across the prediction horizon, preserves the modeling power that independent-head approaches discard. The episode works through both the architectural mechanics and the theoretical argument, making the case that Parallel Token Prediction resolves the tension that has run from ProphetNet through every independent-head scheme in between. Sources: 1. https://arxiv.org/pdf/2512.21323 2. https://arxiv.org/pdf/2404.19737v1 3. https://arxiv.org/pdf/2412.19437 4. https://arxiv.org/pdf/2512.20856 5. Better & Faster Large Language Models via Multi-Token Prediction — Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve (Meta), 2024 https://scholar.google.com/scholar?q=Better+&+Faster+Large+Language+Models+via+Multi-Token+Prediction 6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengmian Hu, et al., 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 7. Lookahead Decoding: Break the Sequential Dependency of LLM Inference — Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, 2024 https://scholar.google.com/scholar?q=Lookahead+Decoding:+Break+the+Sequential+Dependency+of+LLM+Inference 8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty 9. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias (Google), 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 10. Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper (DeepMind), 2023 https://scholar.google.com/scholar?q=Accelerating+Large+Language+Model+Decoding+with+Speculative+Sampling 11. Spec-Bench: A Benchmark for Evaluating Speculative Decoding Approaches — Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, ...
    Show More Show Less
    Less than 1 minute
  • FlashOptim: Optimizers for Memory Efficient Training
    Mar 2 2026
    In this episode, hosts Hal Turing and Dr. Ada Shannon explore the groundbreaking paper "FlashOptim: Optimizers for Memory Efficient Training" by researchers from Databricks AI Research. The discussion centers around innovative techniques to significantly reduce memory usage in neural network training without sacrificing model quality. Key methods such as Optimizer State Quantization, Float Splitting Techniques, and Companded Optimizer State Quantization are unpacked, highlighting their potential to lower memory requirements from 175 GiB to 113 GiB for large models like Llama-3.1-8B. Listeners interested in AI research will find this episode compelling as it addresses the democratization of AI by making advanced models more accessible to those with limited hardware resources. Sources: 1. https://arxiv.org/pdf/2602.23349 2. Mixed Precision Training — Paulius Micikevicius et al., 2018 https://scholar.google.com/scholar?q=Mixed+Precision+Training 3. 8-bit Optimizer States for Memory-Efficient Training — Tim Dettmers et al., 2022 https://scholar.google.com/scholar?q=8-bit+Optimizer+States+for+Memory-Efficient+Training 4. Parameter-Efficient Transfer Learning for NLP — Xiaoqi Li and Percy Liang, 2021 https://scholar.google.com/scholar?q=Parameter-Efficient+Transfer+Learning+for+NLP 5. Q-adam-mini: Memory-efficient 8-bit quantized optimizer for large language model training — approximate, 2023 https://scholar.google.com/scholar?q=Q-adam-mini:+Memory-efficient+8-bit+quantized+optimizer+for+large+language+model+training 6. Memory efficient optimizers with 4-bit states — approximate, 2023 https://scholar.google.com/scholar?q=Memory+efficient+optimizers+with+4-bit+states 7. ECO: Quantized Training without Full-Precision Master Weights — approximate, 2023 https://scholar.google.com/scholar?q=ECO:+Quantized+Training+without+Full-Precision+Master+Weights 8. AI Post Transformers: FlashOptim: Optimizers for Memory Efficient Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-02_urls_1.mp3
    Show More Show Less
    Less than 1 minute
  • Episode: Regular Fourier Features for Nonstationary Gaussian Processes
    Mar 1 2026
    In this episode, hosts Hal Turing and Dr. Ada Shannon explore the paper "Regular Fourier Features for Nonstationary Gaussian Processes" by Arsalan Jawaid, Abdullah Karatas, and Jörg Seewig. The discussion focuses on the innovative use of regular Fourier features to model nonstationary data in Gaussian processes without relying on traditional probability assumptions. This method offers a computationally efficient way to handle nonstationarity, making it particularly relevant for fields like finance and climate modeling. The episode delves into the challenges and potential applications of this approach, highlighting its significance in providing a flexible framework for complex, real-world data scenarios. Sources: 1. Regular Fourier Features for Nonstationary Gaussian Processes — Arsalan Jawaid, Abdullah Karatas, Jörg Seewig, 2026 http://arxiv.org/abs/2602.23006v1 2. Random Features for Large-Scale Kernel Machines — Ali Rahimi, Benjamin Recht, 2007 https://scholar.google.com/scholar?q=Random+Features+for+Large-Scale+Kernel+Machines 3. Spectral Mixture Kernels for Gaussian Processes — Andrew Gordon Wilson, Ryan Prescott Adams, 2013 https://scholar.google.com/scholar?q=Spectral+Mixture+Kernels+for+Gaussian+Processes 4. Nonstationary Gaussian Process Regression through Latent Inputs — Mauricio A. Álvarez, David Luengo, Neil D. Lawrence, 2009 https://scholar.google.com/scholar?q=Nonstationary+Gaussian+Process+Regression+through+Latent+Inputs 5. Gaussian Processes for Time-Series Modeling — Carl Edward Rasmussen, Christopher K. I. Williams, 2006 https://scholar.google.com/scholar?q=Gaussian+Processes+for+Time-Series+Modeling 6. Learning the Kernel Matrix with Semi-Definite Programming — Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, Michael I. Jordan, 2004 https://scholar.google.com/scholar?q=Learning+the+Kernel+Matrix+with+Semi-Definite+Programming 7. Deep Kernel Learning — Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, Eric P. Xing, 2016 https://scholar.google.com/scholar?q=Deep+Kernel+Learning 8. Gaussian Processes for Machine Learning — Carl Edward Rasmussen, Christopher K. I. Williams, 2006 https://scholar.google.com/scholar?q=Gaussian+Processes+for+Machine+Learning 9. Non-stationary Gaussian Process Regression using Point Estimates of Local Smoothness — Andreas Damianou, Michalis Titsias, Neil Lawrence, 2016 https://scholar.google.com/scholar?q=Non-stationary+Gaussian+Process+Regression+using+Point+Estimates+of+Local+Smoothness
    Show More Show Less
    Less than 1 minute
  • Cognizant - New Work, New World 2026
    Mar 1 2026
    Source URLs: - https://www.cognizant.com/en_us/aem-i/document/ai-and-the-future-of-work-report/new-work-new-world-2026-how-ai-is-reshaping-work_new.pdf References discussed: - The Future of Employment: How Susceptible Are Jobs to Computerization? (Carl Benedikt Frey, Michael A. Osborne, 2013) - Artificial Intelligence and Life in 2030 (Peter Stone et al., 2016) - The Economics of Artificial Intelligence: An Agenda (Ajay Agrawal, Joshua Gans, Avi Goldfarb, 2019)
    Show More Show Less
    Less than 1 minute
  • MatFormer: Nested Transformer for Elastic Inference
    Feb 28 2026

    In a collaboration between Google DeepMind, University of Texas at Austin, University of Washington and Harvard published on December 2024 researchers introduce MatFormer, a novel elastic Transformer architecture designed to improve the efficiency of large-scale foundation models. Unlike traditional models that require independent training for different sizes, this framework allows a single universal model to provide hundreds of smaller, accurate submodels without any additional training. This is achieved by embedding a nested "matryoshka" structure within the transformer blocks, allowing layers and attention heads to be adjusted based on available compute resources. The authors also propose a Mix’n’Match heuristic to identify the most effective submodel configurations for specific latency or hardware constraints. Their research demonstrates that MatFormer maintains high performance across various tasks, offering improved consistency between large and small models during deployment. Consequently, this approach enhances techniques like speculative decoding and image retrieval while significantly reducing the memory and cost overhead of serving AI models.


    Source:

    2024MatFormer: Nested Transformer for Elastic InferenceGoogle DeepMind, University of Texas at Austin, University of Washington, Harvard UniversityDevvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jainhttps://arxiv.org/pdf/2310.07707

    Show More Show Less
    20 mins
  • Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models
    Feb 28 2026

    Speculative Streaming is a novel inference method designed to accelerate large language model (LLM) generation without the need for traditional auxiliary "draft" models. By integrating multi-stream attention directly into the target model, the system can perform future n-gram prediction and token verification simultaneously within a single forward pass. This approach eliminates the memory and complexity overhead of managing two separate models, making it exceptionally resource-efficient for hardware with limited capacity. The architecture utilizes tree-structured drafting and parallel pruning to maximize the number of tokens accepted per cycle while maintaining generation quality. Experimental results show speedups ranging from 1.8 to 3.1X across diverse tasks like summarization and structured queries. Ultimately, the method achieves performance comparable to more complex architectures while using significantly fewer additional parameters.


    Source:

    February 2024.Speculative Streaming: Fast LLM Inference without Auxiliary Models.Apple.Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi.https://arxiv.org/pdf/2402.11131

    Show More Show Less
    17 mins
  • Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators
    Feb 28 2026

    Apple researchers have introduced on December 2025 Mirror Speculative Decoding (Mirror-SD), an advanced inference algorithm designed to accelerate large language models by overcoming the sequential bottlenecks of standard decoding. Traditional methods are often limited by the time it takes for a small draft model to suggest tokens before a larger target model can verify them. Mirror-SD breaks this barrier by running the draft and target models in parallel across heterogeneous hardware, specifically utilizing both GPUs and NPUs. This system allows the target model to begin verification while the draft model simultaneously predicts multiple future paths. By employing speculative streaming and early-exit signals, the framework effectively hides the latency of draft generation. Experimental results demonstrate that this approach achieves wall-time speedups of up to 5.8x across various tasks without compromising the accuracy of the original model.


    Source:

    December 2025Mirror Speculative Decoding: Breaking the Serial Barrier in LLM InferenceAppleNikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousovahttps://arxiv.org/pdf/2510.13161

    Show More Show Less
    20 mins
  • EAGLE: Evolution of Lossless Acceleration for LLM Inference
    Feb 28 2026

    The provided documents describe the development and evolution of EAGLE, a high-efficiency framework designed to accelerate Large Language Model (LLM) inference through speculative sampling. By performing autoregression at the feature level rather than the token level and incorporating shifted token sequences to manage sampling uncertainty, the original EAGLE achieves significant speedups while maintaining the exact output distribution of the target model. The technology has progressed into EAGLE-2, which introduces dynamic draft trees, and EAGLE-3, which further enhances performance by fusing multi-layer features and removing feature regression constraints during training. These advancements allow for a latency reduction of up to 6.5x and a doubling of throughput, making them compatible with modern reasoning models and popular serving frameworks like vLLM and SGLang. Overall, the sources highlight a shift toward test-time scaling and more expressive draft models to overcome the inherent slow speeds of sequential text generation.


    Sources:

    1) January 26, 2024

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.Peking University, Microsoft Research, University of Waterloo, Vector Institute.Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang.https://arxiv.org/pdf/2401.15077


    2) November 12, 2024

    EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees.Peking University, Microsoft Research, University of Waterloo, Vector Institute.Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang.https://aclanthology.org/2024.emnlp-main.422.pdf


    4) April 23, 2025

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.Peking University, Microsoft Research, University of Waterloo, Vector Institute.Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang.https://arxiv.org/pdf/2503.01840


    1) September 17 2025An Introduction to Speculative Decoding for Reducing Latency in AI Inference.NVIDIA.Jamie Li, Chenhan Yu, Hao Guo.https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/

    Show More Show Less
    19 mins