AI Post Transformers cover art

AI Post Transformers

AI Post Transformers

Written by: mcgrof
Listen for free

LIMITED TIME OFFER | Get 2 Months for ₹5/month

About this listen

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.
Episodes
  • Parallel Token Prediction: From ProphetNet to Dependent Multi-Token Generation
    Mar 4 2026
    This episode examines the fundamental latency bottleneck in autoregressive language models: sequential token generation requires one full transformer forward pass per output token, leaving GPU parallelism idle during single-user inference. The episode centers on Draxler et al. (5 co-authors, UC Irvine and Chan-Zuckerberg Initiative), whose paper on Parallel Token Prediction landed Christmas Eve 2025 and argues that the independence assumption baked into all prior multi-token schemes is not an acceptable approximation but the actual limiting factor. The paper asks whether multiple tokens can be jointly predicted in a single pass — modeling dependencies among them — without sacrificing the expressiveness that makes autoregressive generation reliable. First author Felix Draxler previously led the Free-form Flows work in 2024, and the normalizing flow machinery he developed there is central to how the paper solves the dependency problem. The episode traces the historical arc carefully. Qi et al. (7 co-authors, Microsoft Research Asia) published ProphetNet in January 2020 — predating GPT-3 by four months — in the encoder-decoder world of seq2seq tasks. Their critique was precise: standard one-step-ahead teacher forcing gives models no incentive to plan ahead, letting local bigram correlations dominate at the expense of long-range coherence. Their answer was n-gram prediction, training the decoder to simultaneously predict tokens at t+1, t+2, and t+3 using parallel heads that did not condition on each other. The independence assumption was already present. When Brown et al. (OpenAI, May 2020) demonstrated that scale and in-context conditioning make the encoder optional, the field shifted to decoder-only architectures — but ProphetNet's core insight migrated cleanly. Gloeckle et al. (FAIR, Meta, April 2024) rebuilt multi-token prediction for decoder-only models using independent output heads, DeepSeek adopted the same approach, and NVIDIA incorporated it into Nemotron 3. The independence assumption migrated with the insight, and Draxler et al. argue that limitation has been compounding ever since. The episode situates Parallel Token Prediction against the two main camps attacking inference latency. Speculative decoding — covered across twenty prior episodes — keeps the model's output distribution unchanged by using a small draft model whose proposals a large verifier checks in one batched pass; the latency gain comes entirely from accepted tokens per step. Multi-token prediction is the other camp: train the model itself to emit several tokens at once, collapsing multiple forward passes into one, at the cost of changed model behavior during training. Draxler et al.'s contribution is showing that jointly predicting dependent tokens, using normalizing flows to capture the conditional structure across the prediction horizon, preserves the modeling power that independent-head approaches discard. The episode works through both the architectural mechanics and the theoretical argument, making the case that Parallel Token Prediction resolves the tension that has run from ProphetNet through every independent-head scheme in between. Sources: 1. https://arxiv.org/pdf/2512.21323 2. https://arxiv.org/pdf/2404.19737v1 3. https://arxiv.org/pdf/2412.19437 4. https://arxiv.org/pdf/2512.20856 5. Better & Faster Large Language Models via Multi-Token Prediction — Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve (Meta), 2024 https://scholar.google.com/scholar?q=Better+&+Faster+Large+Language+Models+via+Multi-Token+Prediction 6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengmian Hu, et al., 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 7. Lookahead Decoding: Break the Sequential Dependency of LLM Inference — Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, 2024 https://scholar.google.com/scholar?q=Lookahead+Decoding:+Break+the+Sequential+Dependency+of+LLM+Inference 8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty 9. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias (Google), 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 10. Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper (DeepMind), 2023 https://scholar.google.com/scholar?q=Accelerating+Large+Language+Model+Decoding+with+Speculative+Sampling 11. Spec-Bench: A Benchmark for Evaluating Speculative Decoding Approaches — Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, ...
    Show More Show Less
    Less than 1 minute
  • FlashOptim: Optimizers for Memory Efficient Training
    Mar 2 2026
    In this episode, hosts Hal Turing and Dr. Ada Shannon explore the groundbreaking paper "FlashOptim: Optimizers for Memory Efficient Training" by researchers from Databricks AI Research. The discussion centers around innovative techniques to significantly reduce memory usage in neural network training without sacrificing model quality. Key methods such as Optimizer State Quantization, Float Splitting Techniques, and Companded Optimizer State Quantization are unpacked, highlighting their potential to lower memory requirements from 175 GiB to 113 GiB for large models like Llama-3.1-8B. Listeners interested in AI research will find this episode compelling as it addresses the democratization of AI by making advanced models more accessible to those with limited hardware resources. Sources: 1. https://arxiv.org/pdf/2602.23349 2. Mixed Precision Training — Paulius Micikevicius et al., 2018 https://scholar.google.com/scholar?q=Mixed+Precision+Training 3. 8-bit Optimizer States for Memory-Efficient Training — Tim Dettmers et al., 2022 https://scholar.google.com/scholar?q=8-bit+Optimizer+States+for+Memory-Efficient+Training 4. Parameter-Efficient Transfer Learning for NLP — Xiaoqi Li and Percy Liang, 2021 https://scholar.google.com/scholar?q=Parameter-Efficient+Transfer+Learning+for+NLP 5. Q-adam-mini: Memory-efficient 8-bit quantized optimizer for large language model training — approximate, 2023 https://scholar.google.com/scholar?q=Q-adam-mini:+Memory-efficient+8-bit+quantized+optimizer+for+large+language+model+training 6. Memory efficient optimizers with 4-bit states — approximate, 2023 https://scholar.google.com/scholar?q=Memory+efficient+optimizers+with+4-bit+states 7. ECO: Quantized Training without Full-Precision Master Weights — approximate, 2023 https://scholar.google.com/scholar?q=ECO:+Quantized+Training+without+Full-Precision+Master+Weights 8. AI Post Transformers: FlashOptim: Optimizers for Memory Efficient Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-02_urls_1.mp3
    Show More Show Less
    Less than 1 minute
  • Episode: Regular Fourier Features for Nonstationary Gaussian Processes
    Mar 1 2026
    In this episode, hosts Hal Turing and Dr. Ada Shannon explore the paper "Regular Fourier Features for Nonstationary Gaussian Processes" by Arsalan Jawaid, Abdullah Karatas, and Jörg Seewig. The discussion focuses on the innovative use of regular Fourier features to model nonstationary data in Gaussian processes without relying on traditional probability assumptions. This method offers a computationally efficient way to handle nonstationarity, making it particularly relevant for fields like finance and climate modeling. The episode delves into the challenges and potential applications of this approach, highlighting its significance in providing a flexible framework for complex, real-world data scenarios. Sources: 1. Regular Fourier Features for Nonstationary Gaussian Processes — Arsalan Jawaid, Abdullah Karatas, Jörg Seewig, 2026 http://arxiv.org/abs/2602.23006v1 2. Random Features for Large-Scale Kernel Machines — Ali Rahimi, Benjamin Recht, 2007 https://scholar.google.com/scholar?q=Random+Features+for+Large-Scale+Kernel+Machines 3. Spectral Mixture Kernels for Gaussian Processes — Andrew Gordon Wilson, Ryan Prescott Adams, 2013 https://scholar.google.com/scholar?q=Spectral+Mixture+Kernels+for+Gaussian+Processes 4. Nonstationary Gaussian Process Regression through Latent Inputs — Mauricio A. Álvarez, David Luengo, Neil D. Lawrence, 2009 https://scholar.google.com/scholar?q=Nonstationary+Gaussian+Process+Regression+through+Latent+Inputs 5. Gaussian Processes for Time-Series Modeling — Carl Edward Rasmussen, Christopher K. I. Williams, 2006 https://scholar.google.com/scholar?q=Gaussian+Processes+for+Time-Series+Modeling 6. Learning the Kernel Matrix with Semi-Definite Programming — Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, Michael I. Jordan, 2004 https://scholar.google.com/scholar?q=Learning+the+Kernel+Matrix+with+Semi-Definite+Programming 7. Deep Kernel Learning — Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, Eric P. Xing, 2016 https://scholar.google.com/scholar?q=Deep+Kernel+Learning 8. Gaussian Processes for Machine Learning — Carl Edward Rasmussen, Christopher K. I. Williams, 2006 https://scholar.google.com/scholar?q=Gaussian+Processes+for+Machine+Learning 9. Non-stationary Gaussian Process Regression using Point Estimates of Local Smoothness — Andreas Damianou, Michalis Titsias, Neil Lawrence, 2016 https://scholar.google.com/scholar?q=Non-stationary+Gaussian+Process+Regression+using+Point+Estimates+of+Local+Smoothness
    Show More Show Less
    Less than 1 minute
No reviews yet