FOUNDATIONS

How LLMs Think Like Clinicians

Large language models and clinical reasoning share more than metaphor—both are fundamentally probabilistic pattern-completion systems navigating uncertainty.

~20 min read 7 readings 4 podcasts

Core Question

What do large language models and clinical reasoning have in common—and how does understanding the parallels help you reason better and use AI tools more effectively?

The Core Mechanism

An LLM predicts the most probable next word given everything preceding it. Clinical reasoning works identically: given this constellation of inputs—history, exam, demographics, epidemiology—what's the most likely diagnosis? Second-most? The differential diagnosis is a probability distribution, weighted by base rates and updated by evidence. Both systems are Bayesian at their core.

This explains why input quality determines output quality. A vague prompt yields vague output; "I don't feel good" yields an unfocused differential. The structured HPI—onset, location, duration, character, aggravating/alleviating factors—is prompt engineering for clinical cognition.

Worked Example: Vague vs. Structured Input

Compare these two inputs:

Vague: "Patient has chest pain."

This generates a broad, unfocused differential: ACS, PE, pneumonia, GERD, MSK, anxiety... The model (or clinician) has no way to weight these possibilities.

Structured: "58-year-old male, diabetic, smoker, presenting with substernal pressure radiating to left arm, onset 2 hours ago while shoveling snow, associated diaphoresis, relieved partially by rest."

Now the probability distribution shifts dramatically. ACS moves to the top; MSK and GERD become unlikely. The same underlying mechanism—pattern completion given context—produces radically different outputs based on input quality.

Visualizing the Shift

See how the probability distribution "collapses" from noise to signal.

Vague Input

Flat Distribution (High Entropy)

Structured Input

Spiky Distribution (Low Entropy)

This is why the structured HPI exists. It's not bureaucratic box-checking; it's prompt engineering for clinical cognition, optimizing the input so the pattern-completion system (your brain, or an LLM) can generate the most useful output.

Parallel Architectures

Training and Specialization

Base LLMs train broadly before fine-tuning for specific tasks. Medical school provides general training; residency fine-tunes for a specialty. Both trade breadth for depth.

Pre-training (general knowledge acquisition) parallels medical school: broad exposure to many domains, building foundational patterns. Fine-tuning (task-specific optimization) parallels residency: narrowing focus, developing specialized expertise, trading breadth for depth.

Just as a cardiologist and a dermatologist start with the same medical school foundation but develop very different pattern libraries, a base LLM can be fine-tuned into a coding assistant, a medical consultant, or a creative writing tool.

Tokenization and Chunking

LLMs break text into tokens—subword units that balance vocabulary size against sequence length. Clinicians chunk information similarly: "classic MI presentation" compresses a constellation of findings into a single cognitive unit.

Expert chunking is why an attending can hear a case presentation and immediately identify the key pattern while a student is still processing individual symptoms. The expert has compressed thousands of prior cases into efficient chunks that map to diagnostic categories.

Few-Shot Learning

Give an LLM a few examples in the prompt, and it adapts its output format and reasoning style accordingly. This is few-shot learning—the model infers the task from examples rather than explicit instructions.

Clinical teaching works identically. Show a learner three cases of drug-induced lupus, and they'll start recognizing the pattern. The teaching attending who says "Let me show you a few examples of this" is doing few-shot prompting for the human learner's pattern-completion system.

Retrieval-Augmented Generation (RAG)

RAG systems retrieve relevant documents before generating a response, grounding output in specific sources rather than relying solely on trained patterns. The clinical equivalent: pulling up UpToDate before answering a question, or checking the formulary before prescribing.

This isn't cheating—it's a cognitive architecture that combines pattern recognition (knowing what to look up) with external retrieval (getting accurate details). The expert clinician knows enough to ask the right questions; they don't memorize every dosing table.

Context Windows and Working Memory

LLMs have finite context; exceed it and earlier information drops. Clinicians forget medication lists from three screens back. Both compensate with external retrieval—the LLM queries knowledge bases; clinicians use UpToDate.

This constraint has practical implications. A patient with a 50-page chart history exceeds working memory; the clinician must decide what's relevant to pull forward. Similarly, an LLM with a 128k token context window still can't process an entire EMR—someone must decide what goes in the prompt.

Temperature and Diagnostic Breadth

LLM "temperature" controls randomness—low sticks to high-probability outputs, high explores alternatives. Protocols demand low temperature (follow the algorithm); diagnostic mysteries require high temperature (what else could this be?).

A sepsis protocol is low-temperature reasoning: if lactate > 2 and suspected infection, start antibiotics within the hour. A diagnostic zebra hunt is high-temperature reasoning: systematically considering unlikely possibilities because the common ones don't fit.

Knowing when to shift between modes is clinical expertise. Running high-temperature reasoning on every straightforward case wastes cognitive resources; running low-temperature reasoning on a diagnostic mystery leads to premature closure.

Attention and Clinical Salience

Transformers weight certain inputs based on relevance. Clinicians do the same—"crushing" chest pain demands different attention than "since Tuesday." The attention mechanism in transformers learns which parts of the input are most relevant to predicting the next token; clinical expertise involves learning which parts of the history are most relevant to the diagnosis.

This is why the same symptom in different contexts triggers different responses. "Headache" in a healthy 25-year-old gets different attention than "headache" in an immunocompromised patient with fever. The input is the same; the attention weighting differs.

Chain-of-Thought: An Actionable Parallel

Chain-of-thought prompting asks an LLM to "think step by step" before answering. This consistently improves performance on complex reasoning tasks. Why? Because it forces the model to externalize intermediate steps rather than jumping directly to a conclusion.

Clinical parallels:

Problem representation: Articulating the one-liner forces you to identify what's actually important
Illness scripts: Explicitly matching findings to prototypical patterns
Diagnostic time-outs: Pausing to ask "what am I missing?" before committing to a plan

The chain-of-thought insight is directly actionable: when reasoning through a complex case, externalize your thinking. Write out the problem representation. List the illness scripts you're considering. Articulate why you're ruling things in or out. This isn't just documentation—it's a reasoning intervention that catches errors.

Key Insight

Recognizing that clinical reasoning is probabilistic pattern-completion isn't reductive. It's the first step toward doing it better, whether the pattern-matcher runs on neurons or GPUs.

How Models and Learners Develop

One striking finding: LLM capabilities don't emerge linearly. Scale up a model, and for a while nothing changes—then discontinuously, new abilities appear. Performance is near-random until a certain critical threshold of scale is reached, after which performance increases substantially above random.

Medical learners develop similarly. The intern's progression isn't a smooth upward slope—it's plateaus punctuated by phase transitions when things "click." Pattern recognition that develops over thousands of encounters isn't additive; at some point, experienced clinicians develop gestalt—sensing "sick" versus "not sick" before articulating why.

Scaffolding Enables Performance Beyond Current Ability

Prompting techniques like chain-of-thought let models perform tasks they'd otherwise fail. Clinical teaching works identically—the attending who walks through a case step-by-step enables the learner to perform beyond their independent level. Vygotsky called this the "zone of proximal development."

This has practical implications for AI tool use: the right scaffolding (prompting strategy) can enable an LLM to perform tasks it would otherwise fail. Similarly, the right clinical scaffolding (structured handoffs, checklists, decision support) can enable clinicians to perform at higher levels than unstructured practice allows.

Capability Overhang

Researchers regularly discover models can do things no one anticipated—the capability was latent, waiting for the right prompt. Learners show the same pattern: struggling with standard presentations, then surprising everyone on a complex case. Part of teaching is probing—finding the question that reveals what the learner can actually do.

Shared Failure Modes

Hallucination

LLMs generate confident, plausible content that's simply false

Premature Closure

Anchoring on early diagnosis, fitting data to that frame

Fluency ≠ Understanding

Well-structured output feels like comprehension but may be surface-level

Encoded Bias

Both absorb biases from training data/environments without "knowing" it

Hallucination and Confabulation

LLMs generate confident, plausible text that's simply false—citations that don't exist, facts that were never true. The model isn't "lying"; it's completing patterns in ways that are statistically plausible but factually wrong.

Clinicians confabulate too. The confident diagnosis that turns out to be wrong, the remembered patient detail that was actually from a different case, the reconstruction of a clinical reasoning process that wasn't actually how you arrived at the diagnosis. Memory is reconstructive, not reproductive, and reconstruction introduces errors.

Mitigation: Verify independently. Don't trust confident output from either system without checking against primary sources. For LLMs, this means checking citations and facts. For yourself, this means building in verification steps and being epistemically humble about your own memory.

These parallel failure modes create a dangerous feedback loop: physician "bullshit"—statements made with indifference to truth, often to appear knowledgeable—ends up in medical literature, EHR documentation, and online content that becomes LLM training data. The models then generate hallucinations that mimic the tone of their training data, perpetuating and scaling the spread of confidently-asserted misinformation. Both systems ultimately suffer from what philosopher Quassim Cassam calls "epistemic insouciance"—a casual indifference to facts and evidence.

Overfitting and Representativeness

Models overfit when they learn training data too specifically, failing to generalize. Clinicians overfit to their training environment—the academic medical center resident who sees zebras everywhere, the community hospital attending who misses rare diseases.

Mitigation: Seek diverse exposure. For LLMs, this means training on diverse data. For clinicians, this means recognizing that your base rates are shaped by your practice environment and adjusting when you're in a different context.

Mode Collapse and Anchoring

LLMs can get stuck generating similar outputs regardless of input—a form of mode collapse. Clinical anchoring is analogous: once you have a working diagnosis, you fit new data to that frame rather than updating appropriately.

The patient labeled "frequent flyer" or "drug-seeking" gets the same differential regardless of new symptoms. The diagnosis made in the ED follows the patient through the admission even when new data contradicts it.

Mitigation: Deliberately generate alternatives. For LLMs, this means asking "what else could this be?" or regenerating with different prompts. For clinical reasoning, this means the diagnostic time-out: systematically considering what would change your mind.

Prompt Injection and History Contamination

LLMs can be manipulated by adversarial prompts that override intended behavior. Clinical reasoning is vulnerable to "history contamination"—the previous diagnosis or framing that shapes how you interpret new data.

The patient transferred with a diagnosis acquires that diagnosis as a cognitive anchor. The triage note that says "anxiety" shapes how the physician interprets chest pain. The chart that says "drug-seeking" determines how pain is managed regardless of current presentation.

Mitigation: Return to primary data. For LLMs, this means clear system prompts and input validation. For clinical reasoning, this means periodically asking: "What if I'd seen this patient fresh, without the prior framing?"

The Eliza Effect and Premature Trust

People anthropomorphize conversational systems, attributing understanding where there's only pattern matching. The same dynamic affects how patients perceive clinicians—fluent communication feels like comprehension.

The physician who uses the right words may not understand the patient's situation. The AI that generates grammatically correct output may not "understand" anything. Fluency is a heuristic for competence, but it's a fallible one.

Mitigation: Probe understanding. Don't assume fluent output reflects deep comprehension. Ask clarifying questions. Check for internal consistency. Evaluate the reasoning, not just the conclusion.

Calibration and Confidence

Well-calibrated systems express uncertainty that tracks accuracy—when they say 70% confident, they're right about 70% of the time. Both LLMs and clinicians struggle with calibration, often expressing more confidence than accuracy warrants.

The differential diagnosis rarely includes probability estimates, and when it does, those estimates are often poorly calibrated. LLMs similarly express confidence in ways that don't track accuracy—a hallucinated citation is delivered with the same linguistic certainty as a real one.

Mitigation: Be epistemically humble. Explicitly acknowledge uncertainty. Use calibration training (available for both humans and AI systems). When a system expresses high confidence, ask what would cause it to be wrong.

Where the Analogy Breaks Down

No analogy is perfect. Here's where the LLM/clinician parallel has limits:

Embodiment: Clinicians have bodies that provide intuitions about pain, fatigue, and physical sensation. LLMs process tokens without embodied experience.
Stakes and accountability: A clinician's errors have professional, legal, and moral consequences that shape reasoning. LLMs have no stakes in their outputs.
Continuous learning: Clinicians update in real-time from individual cases. LLMs (mostly) have fixed training and don't learn from individual interactions.
Metacognition: Clinicians can reflect on their own reasoning processes and deliberately change strategies. LLMs lack this recursive self-awareness.
Causal reasoning: Clinicians understand (or try to understand) causal mechanisms. LLMs learn correlations without causal models.

These differences matter. An LLM that mimics clinical reasoning isn't doing clinical reasoning—it's producing outputs that look like clinical reasoning. The patterns are similar; the underlying substrate is different. Understanding both the parallels and the limits is essential for using these tools appropriately.

Quick-Reference: LLM Concepts → Clinical Equivalents

LLM Concept	Clinical Equivalent	Actionable Insight
Pre-training	Medical school	Broad foundations enable later specialization
Fine-tuning	Residency/fellowship	Depth trades off against breadth
Prompt	HPI/presentation	Input quality determines output quality
Context window	Working memory	Both need external retrieval for complex cases
Temperature	Diagnostic breadth	Match exploration to clinical context
Attention	Clinical salience	Not all inputs deserve equal weight
Chain-of-thought	Problem representation	Externalizing reasoning improves accuracy
Hallucination	Confabulation	Verify independently; don't trust confident output
RAG	Using UpToDate	Pattern recognition + retrieval beats either alone
Few-shot learning	Learning from examples	Cases teach patterns; examples shape output

Practical Implications

For Using AI Tools

Treat outputs as first drafts from a capable but fallible trainee
Provide structured, high-quality input—prompt engineering applies
Verify independently at high-stakes decision points
Probe for alternatives; confidence doesn't track accuracy
Use chain-of-thought prompting for complex reasoning tasks
Recognize that fluent output doesn't imply understanding

For Understanding Your Own Reasoning

You're also a pattern-completion system, subject to the same failure modes
Externalize your reasoning—articulating the logic catches errors intuition misses
Use external retrieval freely; looking things up is expertise, not weakness
Be epistemically humble about your own confidence calibration
Recognize when you're overfitting to your training environment
Build in verification steps; memory is reconstructive, not reproductive

The Meta-Point

The clinicians who thrive alongside AI understand both systems—their shared architecture, parallel failure modes, and complementary strengths. This module gives you the vocabulary to think clearly about both.

Exercises to Try

Prompt comparison: Take a case you're working on. Write two prompts for an LLM—one vague, one structured. Compare the outputs. What did the structured prompt enable?
Temperature mapping: Think of three recent clinical decisions. Which required "low temperature" (protocol-following) reasoning? Which required "high temperature" (exploratory) reasoning? Did you match your approach to the task?
Chain-of-thought practice: Next time you're presenting a case, write out your problem representation before speaking. Did externalizing the reasoning change anything?
Failure mode spotting: Review a recent case where something went wrong (yours or a colleague's). Which failure mode from this module best describes what happened? What mitigation would have helped?

Try This with NotebookLM

In Start Here, you learned to use NotebookLM for document synthesis. This module's readings are perfect candidates:

Upload the readings to a NotebookLM notebook
Ask it to compare how emergent capabilities appear in LLMs vs. medical learners
Generate an Audio Overview for commute-time review
Use your verification skills—what does it get right? What does it oversimplify?

Practicing AI tool use while learning about AI tools reinforces both skills.

Readings

Parallel Pressures: The Common Roots of Doctor Bullshit and LLM Hallucinations

BMJ (Correa Soto et al.) · A provocative analysis of how structural pressures drive both physician confabulation and AI hallucinations—and the dangerous feedback loop between them.

AMIE: A Research AI System for Diagnostic Medical Reasoning

Google DeepMind (Tu et al.) · The landmark paper demonstrating an LLM optimized for diagnostic dialogue.

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Microsoft Research · See the "Medical Scenarios" section for a stunning example of probabilistic pattern completion.

Open-Source AI Matches Top Proprietary LLM in Solving Tough Medical Cases

Harvard Medical School · Comparing Llama vs GPT-4 on diagnostic reasoning.

Head-to-Head Comparisons of AI and Physicians

Mass General Brigham · A direct comparison of clinical reasoning pathways.

Clinical Reasoning and Artificial Intelligence: Can AI Really Think?

PMC · An essential breakdown of the cognitive theory underlying physician reasoning vs AI.

Emergent Abilities in Large Language Models: An Explainer

Georgetown CSET · Understanding how capabilities appear suddenly at scale.

Are Emergent Abilities of Large Language Models a Mirage?

Stanford (Schaeffer et al.) · The critical counter-argument: is emergence real or a metric artifact?

Podcasts & Blogs

NEJM AI Grand Rounds

Hosts Arjun Manrai & Andrew Beam · Essential listening. Start with the Adam Rodman episode.

Ground Truths: When Doctors With A.I. Are Outperformed by A.I. Alone

Eric Topol, MD · A masterclass in synthesizing the current state of AI medical capabilities.

The Curbsiders Teach: AI in Health Professions Education

Drs. Fortman, Rodman, Turner · Specifically discusses using LLMs to improve problem representation.

Video

Prioritize This!

How might LLMs store facts

3Blue1Brown · A brilliantly visual, intuitive explanation of how LLMs actually work under the hood. If you watch one thing, watch this.

Intro to Large Language Models

Andrej Karpathy · The first 30 mins are the gold standard for understanding "next token prediction."

3Blue1Brown: Neural Networks / Deep Learning Series

Visual explanations of how neural networks work · Start with "But What Is a Neural Network?"

Reflection Questions

Think of a recent diagnostic case. At what points were you doing "pattern completion"? When did you deliberately broaden or narrow your differential?
How is premature closure in clinical reasoning similar to an LLM hallucinating with high confidence? What strategies help with both?
When is it appropriate to reason with "low temperature" (protocol-driven) versus "high temperature" (exploratory)? Give examples of each.
Which failure mode from this module do you think you're most susceptible to? What mitigation strategy will you try?

Learning Objectives

Explain how both LLMs and clinical reasoning are probabilistic pattern-completion systems
Identify parallel architectures: training/specialization, context windows, temperature, attention
Describe emergent capabilities in both AI models and medical learners
Recognize shared failure modes: hallucination, premature closure, fluency bias, encoded bias
Apply chain-of-thought prompting to improve both AI outputs and personal reasoning
Use the quick-reference table to map LLM concepts to clinical practice