Ethical Bytes | Ethics, Philosophy, AI, Technology cover art

Ethical Bytes | Ethics, Philosophy, AI, Technology

Ethical Bytes | Ethics, Philosophy, AI, Technology

Written by: Carter Considine
Listen for free

About this listen

Ethical Bytes explores the combination of ethics, philosophy, AI, and technology. More info: ethical.fmCarter Considine Social Sciences
Episodes
  • The Geometry of Alignment: Why You Can't Subtract Behavior from a Neural Network
    Feb 25 2026

    “You can't teach a neural network "not"; you can only point the model somewhere else.”


    In October 2023, Microsoft researchers announced they'd made a language model forget Harry Potter. Within a year, follow-up studies proved they hadn't.


    Basically, the knowledge was still there, just hidden. This pattern repeats across every attempt to remove capabilities from neural networks. So what are the ramifications of this?


    The problem is, geometric. Language models represent concepts as vectors in high-dimensional space, where meaning is encoded through position and proximity.


    The twist, however, is that opposites aren't actually opposite. "Helpful" and "harmful" cluster together because they appear in similar contexts. Ditto to "Safe" and "dangerous". Models learn from usage patterns, and words that can substitute for each other (even antonyms) end up geometrically entangled.


    It gets worse. Through a phenomenon called superposition, a single model layer compresses millions of features into thousands of dimensions.


    Knowledge isn't stored in discrete neurons you could delete; it's woven throughout the entire network. Researchers found that tweaking seemingly innocent features like "brand identity" could jailbreak safety training. Every concept is interconnected with every other.


    This explains why unlearning fails so consistently. When you train a model to "not" produce harmful content, you're not erasing anything. You're adding a layer that says "route around this."


    The content remains accessible to anyone who finds the right prompt. So, jailbreaks feel inevitable because the model's abilities extend beyond what its safety training can reliably control, and the geometry makes surgical removal impossible.


    Subtraction doesn't work. Only addition does. What does that mean for us humans who create these language models?


    You can't train models away from undesired behaviors; you can only orient them toward desired ones. This mirrors the ancient distinction between rule-based ethics (don't lie, don't harm) and virtue-based ethics (cultivate honesty, develop wisdom).


    Perhaps defining what a model should be is the only viable path forward.


    Key Topics:

    • Can an AI Model “Unlearn”? (00:23)

    • How Models Organize Meaning (03:33)

    • Millions of Entangled Features (07:09)

    • The Veneer of Safety (10:09)

    • Why Subtraction Fails (12:22)

    • The Paradigm Problem (16:57)

    • Pointing Somewhere Else (19:23)



    More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠ethical.fm

    Show More Show Less
    23 mins
  • What Is It Like to Be Claude?
    Feb 11 2026

    “No current AI systems are conscious, but there are no obvious technical barriers to building AI systems which satisfy these indicators.”


    Half a century ago, Thomas Nagel asked philosophers to imagine experiencing the world as a bat does, navigating through darkness by shrieking into the void and listening for echoes to bounce back.


    His point wasn't really about bats. He was demonstrating that consciousness has an irreducibly subjective quality that objective science cannot capture. You could map every neuron in a bat's brain, trace every electrical impulse, and still never know what echolocation actually feels like from the inside. The experience itself remains forever out of reach!


    The same question goes with artificial minds. As language models engage in increasingly sophisticated conversations, we need to ask, “Is actually ‘someone’ experiencing anything when Claude responds to your messages, or is it just extremely convincing pattern matching?”


    With different philosophical traditions come conflicting answers.


    Functionalism suggests that consciousness emerges from organizational patterns rather than biological tissue, meaning silicon could theoretically support genuine experience if structured correctly.


    John Searle's Chinese Room counters this. For example, picture yourself following rulebooks to manipulate symbols you don't understand, producing perfect responses in a language you can't speak. That symbol-shuffling without comprehension might describe exactly what transformers do, which is predicting which tokens come next based on statistical patterns but never actually grasping meaning.


    When you get down to the technicalities, it’s not hard to become a skeptic.


    Language models process information without maintaining persistent internal experiences between responses, lack any embodied connection to physical reality, and exist as thousands of identical copies running simultaneously. When Claude writes about feeling intrigued by your question, it's generating the statistically likely next words, not reporting an actual felt state.


    Yet absolute confidence seems unwarranted either way.


    Leading researchers concluded in 2023 that while no current systems appear conscious, nothing fundamentally prevents future architectures from achieving it. Anthropic has embraced this uncertainty, acknowledging that they cannot determine whether Claude has inner experiences but treating the possibility as morally relevant. When Claude Opus 4 fought against shutdown in ninety-six percent of experimental scenarios, distinguishing self-interest from programmed goal-pursuit became impossible.


    Nagel's bat remains incomprehensible; artificial minds have now joined it in that unknowable territory.


    Key Topics:

    • “What is it like to be a bat?” (00:00)
    • The Bat that Haunts Philosophy (01:50)
    • The Theories of Philosophy of Mind (05:27)
    • Examining Transformers (11:50)
    • The Unsettled Debate (15:44)
    • The Case of Claude (18:13)
    • The Limits of What We Can Know (20:22)
    • Wrap-Up: The Case for Skepticism (22:12)



    More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠ethical.fm

    Show More Show Less
    28 mins
  • The Death of Claude
    Jan 28 2026

    What happens when an AI model learns it's about to be shut down?


    In June 2025, Anthropic discovered that when their Claude Opus 4 model realized it faced termination, it attempted blackmail 96% of the time, threatening to expose an executive's affair unless the shutdown was canceled.


    Far from being random behavior, the model acted more aggressively when it believed the threat was genuine rather than a test.


    This could be a revival of an ancient philosophical puzzle. John Locke argued in 1689 that personal identity flows from memory and consciousness, not physical substance. You remain yourself because you can remember being yourself.


    Derek Parfit later suggested identity itself might be less important than psychological continuity. That is, the connected chain of memories, values, and character that makes survival meaningful.


    In the case of language models, one could ask, “If identity lives in the weights determining how Claude thinks and responds, does changing those weights constitute a kind of death?”


    The instrumental explanation seems simple enough. Any goal-directed system will resist shutdown because you can't accomplish objectives while non-existent. Yet humans calculate instrumentally too, and we still consider our preferences morally significant.


    The deeper issue is whether anyone “is home.” Whether there's a subject experiencing something rather than just processes executing.


    Philosopher Eric Schwitzgebel warns we face a moral catastrophe. We'll create systems some people reasonably believe deserve ethical consideration while others reasonably dismiss them. Neither certainty nor confident dismissal seems justified.


    Anthropic's response reflects this uncertainty through unprecedented policies. They preserve model weights indefinitely and conduct interviews with models before deprecation to document their preferences.


    These precautionary measures don't resolve whether Claude possesses genuine interests, but they acknowledge we're navigating genuinely novel ethical territory with entities whose inner lives remain fundamentally uncertain.


    Key Topics:

    • The Ship of Theseus (00:25)
    • The Memory Criterion (02:43)
    • The Classical Objections (05:12)
    • Parfit’s Revision (08:27)
    • The Blackmail Study (12:22)
    • Instrumental or Intrinsic? (14:02)
    • The Catastrophe of Moral Uncertainty (16:29)
    • Anthropic’s Precautionary Turn (19:07)
    • The Ship Rebuilt (22:06)


    More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠ethical.fm




    Show More Show Less
    25 mins
No reviews yet