Episodes

  • The RAG Delusion: What 9 Kubernetes Bugs Reveal About AI Coding Agents
    May 19 2026
    This episode explores the limitations of Retrieval Augmented Generation (RAG) in AI coding agents, particularly when tasked with fixing complex, real-world Kubernetes bugs. It reveals that despite access to extensive documentation, these agents struggle with synthesizing information, reasoning, and understanding the broader implications of changes in distributed systems. Listeners will learn that RAG is not the panacea many assume for intricate software challenges, highlighting a critical gap in AI's ability to interpret and apply knowledge effectively.
    Show More Show Less
    12 mins
  • Debug Log: The Million-Goroutine Memory Leak and the Case for "Boring" Auth
    May 8 2026
    This episode explores a critical Kubernetes authentication gateway's failure, caused by an accumulation of a million dormant goroutines. It details how client-side context cancellations were not properly propagated to upstream proxying goroutines, leading to these lightweight concurrency units holding onto resources indefinitely. Listeners will learn about the crucial importance of meticulous context propagation in Go's concurrency model, especially in I/O-bound networked services, to prevent similar resource leaks and system instability.
    Show More Show Less
    11 mins
  • Chasing the Cart: Why Pinterest Ripped Out Its Sequential Ad Architecture
    May 8 2026
    This episode explores the challenges of traditional multi-stage ad serving architectures, where optimizing for intermediate metrics like clicks can inadvertently sabotage ultimate conversion goals by prematurely filtering out valuable ads. Listeners will learn how integrating sophisticated conversion prediction intelligence much earlier in the pipeline, through a dedicated "Conversion Candidate Generation" component, can overcome these limitations and lead to more effective ad delivery.
    Show More Show Less
    11 mins
  • The Blast Radius of Agentic AI: Why "Five Nines" is a Relic
    May 1 2026
    This episode explores why the traditional "five nines" reliability metric is fundamentally unsuitable for agentic AI systems. It explains that unlike traditional systems, agentic AI can be "up" but still cause catastrophic failures through incorrect autonomous actions, leading to a significantly wider "blast radius" of damage. Listeners will learn about the unique failure modes of these self-directed systems and the critical need to shift focus from mere availability to ensuring correctness and integrity.
    Show More Show Less
    11 mins
  • Phantom in the Page Cache: Unpacking the 10-Line "Copy Fail" Exploit
    May 1 2026
    This episode discusses a 9-year-old, 10-line "Copy Fail" exploit found in the Linux kernel's page cache, highlighting the paradox of such a critical yet subtle vulnerability evading detection for so long. It explores the nature of this "phantom" bug, explaining how its "surgical precision" and exploitation of concurrency in the page cache make it incredibly difficult to detect, even in highly scrutinized software. Listeners will learn about the profound implications of small flaws in critical system components and the challenges of securing complex, concurrent operating systems.
    Show More Show Less
    13 mins
  • Automating the Autopsy: The Promise and Peril of AI-Generated Postmortems
    May 1 2026
    This episode explores the intriguing concept of using AI to write incident postmortems, highlighting its potential for speed, consistency, and automating data synthesis from vast sources. However, it also delves into the significant perils, such as the impact of poor data quality, the risk of AI hallucinations, and AI's inability to grasp the nuanced human "why" behind incidents. Listeners will learn about the dichotomy between AI's data processing power and the essential human element in understanding complex system failures.
    Show More Show Less
    13 mins
  • The Harness and the Lobotomy: Unpacking Anthropic’s 47-Day Degradation
    Apr 25 2026
    This episode explores a 47-day incident where Anthropic's Claude Code appeared to degrade, revealing that the core AI model was intact but its 'harness'—the surrounding infrastructure and system prompts—failed. Listeners will learn how critical this 'harness' is for an AI product's effective performance, and how seemingly minor changes, like lowering default reasoning effort, can lead to significant user frustration and a breakdown of trust between a company and its users.
    Show More Show Less
    18 mins
  • Scaling for Ghosts: 7 Microservices, 47 Users, and the Trap of Resume-Driven Development
    Apr 25 2026
    This episode explores the phenomenon of "Resume-Driven Development," where an engineer at a pre-seed startup built an enterprise-grade distributed system designed for 100,000 users, despite only having 47. It highlights how engineers might prioritize resume-boosting complex infrastructure over a startup's actual needs, leading to significant financial and human capital costs. Listeners will learn about the dangers of over-engineering and the critical misalignment of incentives in early-stage tech development.
    Show More Show Less
    15 mins