Episodes

  • Scaling laws: long context length and in context learning
    Jan 17 2026

    Recent advancements in Long Context Language Models (LCLMs) demonstrate that In-Context Learning (ICL) capabilities follow predictable power-law scaling relationships, where performance improves monotonically with context length up to 10 million tokens and is governed by model depth, width, and training data volume. While Gemini 1.5 exhibits near-perfect recall and continued log-loss improvement at extreme scales, theoretical frameworks reveal that ICL functions mechanistically as implicit gradient descent, effectively performing low-rank weight updates to the model's MLP layers during inference. Furthermore, as context capacity expands, the necessity for sophisticated example selection strategies diminishes; simple random selection combined with data augmentation to fill the context window often yields optimal results, marking a shift from selection optimization to capacity utilization.


    Sources:


    1. **Gemini Team, Google** (2024)

    *Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context*

    https://arxiv.org/pdf/2403.05530


    2. **Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob (GS) Oh, Siddharth Dalmia, Prateek Kolhar** (2024)

    *Revisiting In-Context Learning with Long Context Language Models*

    https://arxiv.org/pdf/2412.16926


    3. **Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, et al.** (2025)

    *A Comprehensive Survey on Long Context Language Modeling*

    https://arxiv.org/pdf/2503.17407


    4. **Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo** (2025)

    *Learning without training: The implicit dynamics of in-context learning*

    https://arxiv.org/pdf/2507.16003


    5. **Sushant Mehta, Ishan Gupta** (2025)

    *Scaling Laws and In-Context Learning: A Unified Theoretical Framework*

    https://arxiv.org/pdf/2511.06232

    Show More Show Less
    13 mins
  • DeepSeek Engram: Scaling Large Language Models via Conditional Memory Lookup
    Jan 14 2026

    On January 12, 2026 DeepSeek released its paper on **Engram**, a novel AI architecture that incorporates **conditional memory** to optimize how large language models handle information. By utilizing a **lookup mechanism for static patterns**, this technology separates an AI's logical reasoning from its factual knowledge base. This structural shift allows massive models to run on **cheaper hardware** by offloading memory requirements to standard host RAM without sacrificing speed. Research indicates that this approach effectively **increases model depth**, freeing up the system's core processing power for more complex reasoning and long-context tasks. Ultimately, the **Engram** module enables superior performance across coding, math, and general logic compared to traditional architectures. This innovation suggests a future where AI is significantly **more efficient and accessible** through the strategic decoupling of memory and computation.


    Source:

    https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf

    Show More Show Less
    14 mins
  • PageANN: Scalable Disk ANNS with Page-Aligned Graphs
    Dec 7 2025

    The research paper presents PageANN, a novel framework engineered to overcome the severe latency and scalability limitations facing existing **disk-based Approximate Nearest Neighbor Search (ANNS)** methods used in vector databases. Current systems suffer from inefficient search paths and a crucial misalignment between logical graph node size and the **physical I/O granularity of Solid-State Drives (SSDs)**. PageANN introduces a core innovation: a **page-node graph structure** that directly maps logical graph nodes to physical SSD pages, significantly shortening I/O traversal paths and maximizing data utility during retrieval. This is supported by a co-designed **disk data layout** that embeds compressed neighbor vectors within each page and a dynamic **memory management strategy** utilizing lightweight indexing for fast query routing. According to experimental results, PageANN consistently **outperforms state-of-the-art techniques**, achieving substantial gains in throughput and latency across diverse datasets and memory constraints while maintaining comparable recall accuracy.


    Source:

    https://arxiv.org/pdf/2509.25487

    Show More Show Less
    14 mins
  • NeurIPS 2025: Homogeneous Keys, Heterogeneous Values
    Dec 4 2025

    This research presents a novel method for efficient long-context modeling in Large Language Models (LLMs) by tackling the quadratic complexity of attention mechanisms through KV cache compression. The core discovery is a fundamental **local KV cache asymmetry**, which reveals that adjacent attention keys exhibit high structural homogeneity, while their associated value vectors possess distinct, heterogeneous distributions. To capitalize on this finding, the authors propose **AsymKV**, a training-free compression framework that shifts information loss from heterogeneous values to homogeneous keys. AsymKV operates by applying **homogeneity-based merging to keys** using a mathematically derived optimal vector, paired with a **lossless value representation scheme** utilizing cardinality-aware normalization to preserve vital information. Extensive empirical results on benchmarks like LongBench, across diverse models such as LLaMA3.1-8B, confirm that **AsymKV consistently surpasses state-of-the-art long-context methods** in terms of accuracy and information retention, offering improved performance with practical inference efficiency.


    Source:

    https://arxiv.org/pdf/2506.05410

    Show More Show Less
    15 mins
  • NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
    Nov 29 2025

    The research systematically investigates the effects of integrating various gating mechanisms into the standard softmax attention layer, comparing over thirty configurations across dense and Mixture-of-Experts Large Language Models. The central finding demonstrates that applying an elementwise, head-specific sigmoid gate immediately following the Scaled Dot-Product Attention (SDPA) output consistently yields the most substantial improvement in overall performance metrics. This successful gating method also provides superior training stability, allowing models to converge effectively under larger learning rates and mitigating disruptive loss spikes during optimization. The improved efficacy is attributed to two factors: introducing essential non-linearity into the low-rank attention mapping and generating input-dependent sparse gating scores. Crucially, this sparsity acts to normalize attention dynamics, eliminating the 'attention sink' problem where initial tokens dominate attention scores, thereby facilitating notably better long-context extrapolation. These demonstrated benefits led to the incorporation of this specific gated attention design into the forthcoming Qwen3-Next models.


    Source:

    https://openreview.net/pdf?id=1b7whO4SfY

    Show More Show Less
    15 mins
  • NeurIPS 2025: Large Language Diffusion Models
    Nov 29 2025

    This research paper introduces LLaDA, an 8-billion parameter language model based on the masked diffusion model (MDM) architecture, specifically developed to challenge the assumption that core Large Language Model (LLM) capabilities are exclusive to autoregressive models (ARMs). Unlike ARMs that predict the next token sequentially, LLaDA employs a generative approach featuring a forward token-masking process and a reverse process that simultaneously predicts masked tokens using a Transformer network. Trained and evaluated from scratch, LLaDA demonstrates strong scalability and achieves performance comparable to advanced ARM baselines like LLaMA 3 8B across various benchmarks covering general knowledge, math, and code generation. Crucially, the non-autoregressive nature enables bidirectional modeling, which allows LLaDA to effectively address the reversal curse and outperform contemporary models, including GPT-4o, on complex reversal reasoning tasks. These findings confirm that fundamental generative modeling principles, rather than dependence on sequential ARMs, underpin essential LLM capabilities. The work concludes that diffusion models offer a promising new paradigm for building robust, large-scale language models.


    Source:

    https://openreview.net/pdf?id=KnqiC0znVF

    Show More Show Less
    13 mins
  • NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example
    Nov 29 2025

    This research examines the data efficiency of Reinforcement Learning with Verifiable Reward (RLVR) when applied to large language models for mathematical reasoning tasks. The paper's most significant finding is the success of 1-shot RLVR, showing that comparable performance to using a large training dataset can be achieved using just a single, carefully selected example. This result suggests that RLVR is effective primarily because it activates the strong latent reasoning capabilities already present in the base model, rather than imparting new domain knowledge. An interesting phenomenon observed during training is "post-saturation generalization," where the model's test performance continues to rise long after training accuracy has saturated and the model has begun overfitting the single example. Ablation studies indicate that while policy gradient loss is the main source of improvement, entropy loss is essential for encouraging the exploration needed to realize this enhanced long-term generalization.


    Source:

    https://openreview.net/pdf?id=IBrRNLr6JA

    Show More Show Less
    13 mins
  • NeurIPS 2025: Parallel Scaling Law for Language Models
    Nov 29 2025

    The research proposes Parallel Scaling (PARSCALE) as a novel, efficient strategy to enhance Large Language Model (LLM) capacity by increasing parallel computation rather than merely growing the parameter count. This method reuses existing model parameters by feeding multiple parallel input streams (differentiated by learned prefixes) and dynamically combining their outputs into a single prediction. Through extensive testing, the paper develops a new scaling law, showing that scaling computation by a factor of P provides performance gains roughly equivalent to scaling parameters by a factor of O(N logP). PARSCALE demonstrates particular effectiveness in boosting performance on reasoning-intensive tasks like coding and mathematics problems. Critically, this scaling technique offers superior efficiency during inference, requiring significantly less memory and time increase than traditional parameter scaling, thereby making it highly suitable for low-resource edge deployment.


    Source:

    https://openreview.net/pdf?id=dEi1S731lk

    Show More Show Less
    16 mins