How LLMs Think Like Clinicians
Large language models and clinical reasoning share more than metaphor—both are fundamentally probabilistic pattern-completion systems navigating uncertainty.
What do large language models and clinical reasoning have in common—and how does understanding the parallels help you reason better and use AI tools more effectively?
The Core Mechanism
An LLM predicts the most probable next word given everything preceding it. Clinical reasoning works identically: given this constellation of inputs—history, exam, demographics, epidemiology—what's the most likely diagnosis? Second-most? The differential diagnosis is a probability distribution, weighted by base rates and updated by evidence. Both systems are Bayesian at their core.
This explains why input quality determines output quality. A vague prompt yields vague output; "I don't feel good" yields an unfocused differential. The structured HPI—onset, location, duration, character, aggravating/alleviating factors—is prompt engineering for clinical cognition.
Worked Example: Vague vs. Structured Input
Compare these two inputs:
This generates a broad, unfocused differential: ACS, PE, pneumonia, GERD, MSK, anxiety... The model (or clinician) has no way to weight these possibilities.
Now the probability distribution shifts dramatically. ACS moves to the top; MSK and GERD become unlikely. The same underlying mechanism—pattern completion given context—produces radically different outputs based on input quality.
See how the probability distribution "collapses" from noise to signal.
This is why the structured HPI exists. It's not bureaucratic box-checking; it's prompt engineering for clinical cognition, optimizing the input so the pattern-completion system (your brain, or an LLM) can generate the most useful output.
Parallel Architectures
Training and Specialization
Base LLMs train broadly before fine-tuning for specific tasks. Medical school provides general training; residency fine-tunes for a specialty. Both trade breadth for depth.
Pre-training (general knowledge acquisition) parallels medical school: broad exposure to many domains, building foundational patterns. Fine-tuning (task-specific optimization) parallels residency: narrowing focus, developing specialized expertise, trading breadth for depth.
Just as a cardiologist and a dermatologist start with the same medical school foundation but develop very different pattern libraries, a base LLM can be fine-tuned into a coding assistant, a medical consultant, or a creative writing tool.
Tokenization and Chunking
LLMs break text into tokens—subword units that balance vocabulary size against sequence length. Clinicians chunk information similarly: "classic MI presentation" compresses a constellation of findings into a single cognitive unit.
Expert chunking is why an attending can hear a case presentation and immediately identify the key pattern while a student is still processing individual symptoms. The expert has compressed thousands of prior cases into efficient chunks that map to diagnostic categories.
Few-Shot Learning
Give an LLM a few examples in the prompt, and it adapts its output format and reasoning style accordingly. This is few-shot learning—the model infers the task from examples rather than explicit instructions.
Clinical teaching works identically. Show a learner three cases of drug-induced lupus, and they'll start recognizing the pattern. The teaching attending who says "Let me show you a few examples of this" is doing few-shot prompting for the human learner's pattern-completion system.
Retrieval-Augmented Generation (RAG)
RAG systems retrieve relevant documents before generating a response, grounding output in specific sources rather than relying solely on trained patterns. The clinical equivalent: pulling up UpToDate before answering a question, or checking the formulary before prescribing.
This isn't cheating—it's a cognitive architecture that combines pattern recognition (knowing what to look up) with external retrieval (getting accurate details). The expert clinician knows enough to ask the right questions; they don't memorize every dosing table.
Context Windows and Working Memory
LLMs have finite context; exceed it and earlier information drops. Clinicians forget medication lists from three screens back. Both compensate with external retrieval—the LLM queries knowledge bases; clinicians use UpToDate.
This constraint has practical implications. A patient with a 50-page chart history exceeds working memory; the clinician must decide what's relevant to pull forward. Similarly, an LLM with a 128k token context window still can't process an entire EMR—someone must decide what goes in the prompt.
Temperature and Diagnostic Breadth
LLM "temperature" controls randomness—low sticks to high-probability outputs, high explores alternatives. Protocols demand low temperature (follow the algorithm); diagnostic mysteries require high temperature (what else could this be?).
A sepsis protocol is low-temperature reasoning: if lactate > 2 and suspected infection, start antibiotics within the hour. A diagnostic zebra hunt is high-temperature reasoning: systematically considering unlikely possibilities because the common ones don't fit.
Knowing when to shift between modes is clinical expertise. Running high-temperature reasoning on every straightforward case wastes cognitive resources; running low-temperature reasoning on a diagnostic mystery leads to premature closure.
Attention and Clinical Salience
Transformers weight certain inputs based on relevance. Clinicians do the same—"crushing" chest pain demands different attention than "since Tuesday." The attention mechanism in transformers learns which parts of the input are most relevant to predicting the next token; clinical expertise involves learning which parts of the history are most relevant to the diagnosis.
This is why the same symptom in different contexts triggers different responses. "Headache" in a healthy 25-year-old gets different attention than "headache" in an immunocompromised patient with fever. The input is the same; the attention weighting differs.
Chain-of-Thought: An Actionable Parallel
Chain-of-thought prompting asks an LLM to "think step by step" before answering. This consistently improves performance on complex reasoning tasks. Why? Because it forces the model to externalize intermediate steps rather than jumping directly to a conclusion.
Clinical parallels:
- Problem representation: Articulating the one-liner forces you to identify what's actually important
- Illness scripts: Explicitly matching findings to prototypical patterns
- Diagnostic time-outs: Pausing to ask "what am I missing?" before committing to a plan
The chain-of-thought insight is directly actionable: when reasoning through a complex case, externalize your thinking. Write out the problem representation. List the illness scripts you're considering. Articulate why you're ruling things in or out. This isn't just documentation—it's a reasoning intervention that catches errors.
Recognizing that clinical reasoning is probabilistic pattern-completion isn't reductive. It's the first step toward doing it better, whether the pattern-matcher runs on neurons or GPUs.
How Models and Learners Develop
One striking finding: LLM capabilities don't emerge linearly. Scale up a model, and for a while nothing changes—then discontinuously, new abilities appear. Performance is near-random until a certain critical threshold of scale is reached, after which performance increases substantially above random.
Medical learners develop similarly. The intern's progression isn't a smooth upward slope—it's plateaus punctuated by phase transitions when things "click." Pattern recognition that develops over thousands of encounters isn't additive; at some point, experienced clinicians develop gestalt—sensing "sick" versus "not sick" before articulating why.
Scaffolding Enables Performance Beyond Current Ability
Prompting techniques like chain-of-thought let models perform tasks they'd otherwise fail. Clinical teaching works identically—the attending who walks through a case step-by-step enables the learner to perform beyond their independent level. Vygotsky called this the "zone of proximal development."
This has practical implications for AI tool use: the right scaffolding (prompting strategy) can enable an LLM to perform tasks it would otherwise fail. Similarly, the right clinical scaffolding (structured handoffs, checklists, decision support) can enable clinicians to perform at higher levels than unstructured practice allows.
Capability Overhang
Researchers regularly discover models can do things no one anticipated—the capability was latent, waiting for the right prompt. Learners show the same pattern: struggling with standard presentations, then surprising everyone on a complex case. Part of teaching is probing—finding the question that reveals what the learner can actually do.
Shared Failure Modes
Hallucination and Confabulation
LLMs generate confident, plausible text that's simply false—citations that don't exist, facts that were never true. The model isn't "lying"; it's completing patterns in ways that are statistically plausible but factually wrong.
Clinicians confabulate too. The confident diagnosis that turns out to be wrong, the remembered patient detail that was actually from a different case, the reconstruction of a clinical reasoning process that wasn't actually how you arrived at the diagnosis. Memory is reconstructive, not reproductive, and reconstruction introduces errors.
Mitigation: Verify independently. Don't trust confident output from either system without checking against primary sources. For LLMs, this means checking citations and facts. For yourself, this means building in verification steps and being epistemically humble about your own memory.
These parallel failure modes create a dangerous feedback loop: physician "bullshit"—statements made with indifference to truth, often to appear knowledgeable—ends up in medical literature, EHR documentation, and online content that becomes LLM training data. The models then generate hallucinations that mimic the tone of their training data, perpetuating and scaling the spread of confidently-asserted misinformation. Both systems ultimately suffer from what philosopher Quassim Cassam calls "epistemic insouciance"—a casual indifference to facts and evidence.
Overfitting and Representativeness
Models overfit when they learn training data too specifically, failing to generalize. Clinicians overfit to their training environment—the academic medical center resident who sees zebras everywhere, the community hospital attending who misses rare diseases.
Mitigation: Seek diverse exposure. For LLMs, this means training on diverse data. For clinicians, this means recognizing that your base rates are shaped by your practice environment and adjusting when you're in a different context.
Mode Collapse and Anchoring
LLMs can get stuck generating similar outputs regardless of input—a form of mode collapse. Clinical anchoring is analogous: once you have a working diagnosis, you fit new data to that frame rather than updating appropriately.
The patient labeled "frequent flyer" or "drug-seeking" gets the same differential regardless of new symptoms. The diagnosis made in the ED follows the patient through the admission even when new data contradicts it.
Mitigation: Deliberately generate alternatives. For LLMs, this means asking "what else could this be?" or regenerating with different prompts. For clinical reasoning, this means the diagnostic time-out: systematically considering what would change your mind.
Prompt Injection and History Contamination
LLMs can be manipulated by adversarial prompts that override intended behavior. Clinical reasoning is vulnerable to "history contamination"—the previous diagnosis or framing that shapes how you interpret new data.
The patient transferred with a diagnosis acquires that diagnosis as a cognitive anchor. The triage note that says "anxiety" shapes how the physician interprets chest pain. The chart that says "drug-seeking" determines how pain is managed regardless of current presentation.
Mitigation: Return to primary data. For LLMs, this means clear system prompts and input validation. For clinical reasoning, this means periodically asking: "What if I'd seen this patient fresh, without the prior framing?"
The Eliza Effect and Premature Trust
People anthropomorphize conversational systems, attributing understanding where there's only pattern matching. The same dynamic affects how patients perceive clinicians—fluent communication feels like comprehension.
The physician who uses the right words may not understand the patient's situation. The AI that generates grammatically correct output may not "understand" anything. Fluency is a heuristic for competence, but it's a fallible one.
Mitigation: Probe understanding. Don't assume fluent output reflects deep comprehension. Ask clarifying questions. Check for internal consistency. Evaluate the reasoning, not just the conclusion.
Calibration and Confidence
Well-calibrated systems express uncertainty that tracks accuracy—when they say 70% confident, they're right about 70% of the time. Both LLMs and clinicians struggle with calibration, often expressing more confidence than accuracy warrants.
The differential diagnosis rarely includes probability estimates, and when it does, those estimates are often poorly calibrated. LLMs similarly express confidence in ways that don't track accuracy—a hallucinated citation is delivered with the same linguistic certainty as a real one.
Mitigation: Be epistemically humble. Explicitly acknowledge uncertainty. Use calibration training (available for both humans and AI systems). When a system expresses high confidence, ask what would cause it to be wrong.
Where the Analogy Breaks Down
No analogy is perfect. Here's where the LLM/clinician parallel has limits:
- Embodiment: Clinicians have bodies that provide intuitions about pain, fatigue, and physical sensation. LLMs process tokens without embodied experience.
- Stakes and accountability: A clinician's errors have professional, legal, and moral consequences that shape reasoning. LLMs have no stakes in their outputs.
- Continuous learning: Clinicians update in real-time from individual cases. LLMs (mostly) have fixed training and don't learn from individual interactions.
- Metacognition: Clinicians can reflect on their own reasoning processes and deliberately change strategies. LLMs lack this recursive self-awareness.
- Causal reasoning: Clinicians understand (or try to understand) causal mechanisms. LLMs learn correlations without causal models.
These differences matter. An LLM that mimics clinical reasoning isn't doing clinical reasoning—it's producing outputs that look like clinical reasoning. The patterns are similar; the underlying substrate is different. Understanding both the parallels and the limits is essential for using these tools appropriately.
Quick-Reference: LLM Concepts → Clinical Equivalents
| LLM Concept | Clinical Equivalent | Actionable Insight |
|---|---|---|
| Pre-training | Medical school | Broad foundations enable later specialization |
| Fine-tuning | Residency/fellowship | Depth trades off against breadth |
| Prompt | HPI/presentation | Input quality determines output quality |
| Context window | Working memory | Both need external retrieval for complex cases |
| Temperature | Diagnostic breadth | Match exploration to clinical context |
| Attention | Clinical salience | Not all inputs deserve equal weight |
| Chain-of-thought | Problem representation | Externalizing reasoning improves accuracy |
| Hallucination | Confabulation | Verify independently; don't trust confident output |
| RAG | Using UpToDate | Pattern recognition + retrieval beats either alone |
| Few-shot learning | Learning from examples | Cases teach patterns; examples shape output |
Practical Implications
For Using AI Tools
- Treat outputs as first drafts from a capable but fallible trainee
- Provide structured, high-quality input—prompt engineering applies
- Verify independently at high-stakes decision points
- Probe for alternatives; confidence doesn't track accuracy
- Use chain-of-thought prompting for complex reasoning tasks
- Recognize that fluent output doesn't imply understanding
For Understanding Your Own Reasoning
- You're also a pattern-completion system, subject to the same failure modes
- Externalize your reasoning—articulating the logic catches errors intuition misses
- Use external retrieval freely; looking things up is expertise, not weakness
- Be epistemically humble about your own confidence calibration
- Recognize when you're overfitting to your training environment
- Build in verification steps; memory is reconstructive, not reproductive
The clinicians who thrive alongside AI understand both systems—their shared architecture, parallel failure modes, and complementary strengths. This module gives you the vocabulary to think clearly about both.
Exercises to Try
- Prompt comparison: Take a case you're working on. Write two prompts for an LLM—one vague, one structured. Compare the outputs. What did the structured prompt enable?
- Temperature mapping: Think of three recent clinical decisions. Which required "low temperature" (protocol-following) reasoning? Which required "high temperature" (exploratory) reasoning? Did you match your approach to the task?
- Chain-of-thought practice: Next time you're presenting a case, write out your problem representation before speaking. Did externalizing the reasoning change anything?
- Failure mode spotting: Review a recent case where something went wrong (yours or a colleague's). Which failure mode from this module best describes what happened? What mitigation would have helped?
In Start Here, you learned to use NotebookLM for document synthesis. This module's readings are perfect candidates:
- Upload the readings to a NotebookLM notebook
- Ask it to compare how emergent capabilities appear in LLMs vs. medical learners
- Generate an Audio Overview for commute-time review
- Use your verification skills—what does it get right? What does it oversimplify?
Practicing AI tool use while learning about AI tools reinforces both skills.
Readings
Podcasts & Blogs
Video
Reflection Questions
- Think of a recent diagnostic case. At what points were you doing "pattern completion"? When did you deliberately broaden or narrow your differential?
- How is premature closure in clinical reasoning similar to an LLM hallucinating with high confidence? What strategies help with both?
- When is it appropriate to reason with "low temperature" (protocol-driven) versus "high temperature" (exploratory)? Give examples of each.
- Which failure mode from this module do you think you're most susceptible to? What mitigation strategy will you try?
Learning Objectives
- Explain how both LLMs and clinical reasoning are probabilistic pattern-completion systems
- Identify parallel architectures: training/specialization, context windows, temperature, attention
- Describe emergent capabilities in both AI models and medical learners
- Recognize shared failure modes: hallucination, premature closure, fluency bias, encoded bias
- Apply chain-of-thought prompting to improve both AI outputs and personal reasoning
- Use the quick-reference table to map LLM concepts to clinical practice