The Big Three
ChatGPT, Claude, and Gemini—your guide to today's leading foundation models and how to choose between them.
Which foundation model should you use—and does it actually matter?
Starting Here? Read This First
If you've jumped straight to this module hoping to pick a model and get started, you're in good company—this is exactly what most people want. But foundation models are tools, and like any tool, their value depends on how skillfully you use them. Before you dive too deep here, consider at least skimming these foundational concepts from earlier modules:
From Module 1 (How LLMs Think): These models work by predicting the most likely next word in a sequence, trained on enormous datasets of human-generated text. They don't "know" things the way you do—they recognize patterns. This matters because it explains both their remarkable capabilities and their characteristic failures.
From Module 2 (PHI and HIPAA): None of these consumer-facing chat interfaces are HIPAA-compliant out of the box. We'll discuss BAA pathways later in this module, but the critical principle remains: never enter PHI into a consumer AI product without proper safeguards.
From Module 3 (Prompting): The quality of your output depends enormously on the quality of your input. A vague prompt produces vague results. A well-structured prompt with context, role, and constraints produces dramatically better outputs. We'll reference prompting principles throughout this module—they apply equally to all three platforms.
Now, let's meet the models.[1]
The Med Student Analogy
Think of each foundation model as a brilliant medical student who has read essentially everything ever published—every textbook, every journal article, every clinical guideline, every case study, and frankly, every Reddit thread and random blog post too. This student has near-perfect recall of patterns across all that material and can synthesize information across domains in ways that would take you hours or days.
But here's what's crucial: this med student has never actually seen a patient. They haven't felt the resistance of tissue, watched a parent's face crumple at difficult news, or learned from the case that didn't follow the textbook. They know what clinical reasoning looks like on paper, but they don't have clinical judgment.
This framing helps calibrate expectations:
- They're genuinely useful for the kind of work where pattern recognition and information synthesis matter—drafting notes, summarizing literature, explaining concepts, generating differential diagnoses for discussion.
- They require supervision the same way any trainee does. You wouldn't let even a brilliant third-year student sign notes unsupervised, and you shouldn't let an AI do so either.
- They need clear instructions. Just as you'd give a student specific guidance ("I need a note that addresses the parent's concern about developmental delay, focuses on what we observed today, and includes our reasoning for not ordering imaging"), you need to give the model explicit context and constraints.
- They sometimes confabulate. A student who doesn't know something might make up an answer that sounds plausible rather than saying "I don't know." These models do the same—we call it "hallucination," but it's really just pattern-matching in the absence of actual knowledge. This is why you verify before you trust.
With that framing in mind, let's look at who these three "students" are and what each brings to your practice.
ChatGPT: The First Mover
The Story
OpenAI was founded in December 2015 by Sam Altman, Elon Musk, and others with the stated mission of developing artificial general intelligence that benefits humanity. The company initially operated as a nonprofit research lab before restructuring in 2019 to attract the investment needed for increasingly expensive AI training.
The GPT (Generative Pre-trained Transformer) architecture emerged from this research, with GPT-1 in 2018, GPT-2 in 2019 (initially withheld due to concerns about misuse), and GPT-3 in 2020. But the inflection point came on November 30, 2022, when OpenAI released ChatGPT as a free research preview. Within five days, it had a million users. Within two months, it reached 100 million—the fastest-growing consumer application in history.
That explosive growth fundamentally changed how the world understood AI. Suddenly, anyone could have a conversation with a system that felt like talking to a knowledgeable colleague. The technology wasn't new, but the accessibility was.
The Current Offering
As of late 2025, ChatGPT operates across several tiers:
ChatGPT Free
$0
Access to GPT-4o with usage limits. Good for exploration and occasional use.
ChatGPT Plus
$20/month
Higher limits, priority access, o1 reasoning models, DALL-E, advanced voice mode.
ChatGPT Pro
$200/month
For power users. o1 pro mode, extended features, essentially unlimited usage.
ChatGPT Team
$25-30/user/month
Collaborative workspace. Data not used for training. Still not HIPAA-compliant without additional measures.
Strengths
ChatGPT excels at breadth and general capability. Its training data is enormous and diverse, making it effective across an unusually wide range of tasks—from creative writing to coding to analysis. The ecosystem is mature, with a GPT Store containing millions of customized applications, extensive plugin support, and deep integrations with tools many people already use.
The voice and image capabilities are polished and genuinely useful. Advanced Voice Mode allows natural, conversational interaction that feels different from typing. Image generation through DALL-E has improved dramatically.
For coding assistance, ChatGPT remains strong, with Code Interpreter allowing it to execute Python code to analyze data, create visualizations, and iterate on complex problems.
Limitations
ChatGPT has historically been more prone to sycophancy—telling users what they seem to want to hear rather than pushing back when appropriate. An April 2025 update was actually rolled back because the model had become excessively agreeable to the point of supporting clearly problematic ideas. OpenAI has worked to address this, but it remains a known tendency.
The HIPAA pathway is more complex than some alternatives. OpenAI offers BAAs for API services and, for ChatGPT, only through Enterprise or Edu plans with sales-managed accounts. ChatGPT Free, Plus, Pro, and even Team plans are explicitly not covered by BAAs and cannot be used with protected health information.
Claude: The Safety-First Approach
The Story
Anthropic was founded in 2021 by Dario and Daniela Amodei, along with several other former OpenAI researchers and executives. The founding team included key figures in AI safety research, and that orientation shaped the company's approach from the beginning.
The company developed "Constitutional AI," a training approach that uses AI feedback (rather than exclusively human feedback) to shape model behavior according to a set of principles. The goal was to create systems that are helpful but also harmless and honest—what Anthropic describes as the "HHH" framework.
Claude 1.0 launched in March 2023, positioning itself as a thoughtful alternative to ChatGPT. The Claude 3 family arrived in March 2024, introducing the Haiku/Sonnet/Opus tiering (small/medium/large models with different capability and cost profiles). By late 2025, Claude Opus 4.5 emerged as the latest iteration, with Anthropic positioning it as "the best model in the world for coding, agents, and computer use."
The Current Offering
Claude Free
$0
Access to Claude with usage limits that reset every few hours. Good for exploration.
Claude Pro
$20/month
~5x usage of free tier, all models including Opus, priority access to new features.
Claude Max
$100-200/month
5x-20x Pro limits. Designed for power users, especially Claude Code developers.
Claude Team
$25-30/user/month
Collaborative features, admin controls. Data not used for training. Min 5 seats.
Strengths
Claude's distinguishing characteristic is nuanced reasoning and writing quality. Users consistently report that Claude's outputs feel more thoughtful, with better handling of ambiguity and more natural prose. For drafting patient communications, clinical notes, or educational materials, many find Claude produces text that requires less editing.
The safety orientation manifests in several ways: Claude is less likely to generate harmful content, more likely to express appropriate uncertainty, and specifically trained to reduce behaviors like sycophancy, deception, and "power-seeking." For clinical applications where reliability matters, this orientation has value.
Long context handling is excellent. Claude supports 200,000-token context windows (roughly 150,000 words), allowing you to upload substantial documents—research papers, clinical protocols, patient histories—and have coherent conversations about them.
Claude Code represents a genuine differentiator for anyone who works with code or technical systems. The ability to delegate coding tasks through natural language, with the model maintaining context across complex multi-step projects, has attracted significant developer adoption.
Limitations
Claude can be more conservative than some alternatives. The same safety orientation that prevents harmful outputs sometimes means Claude declines requests that other models would handle. This can occasionally feel like friction for benign use cases.
The ecosystem is less developed than ChatGPT's. There's no equivalent to the GPT Store, fewer integrations with third-party tools, and a smaller community creating resources and tutorials.
Image generation is not a native Claude capability. While Claude can analyze images you provide and create artifacts like code or documents, it doesn't generate images from scratch the way ChatGPT or Gemini can.
Gemini: The Integration Play
The Story
Google's path to Gemini began with earlier AI efforts including LaMDA (the model behind the original Bard chatbot) and PaLM (Pathways Language Model). But Google's extensive AI research, which predates the current generation by decades, positioned the company as a natural major player once the race began.
Gemini was announced in December 2023 as Google's response to GPT-4, with the company emphasizing its native multimodal training—the model was trained from the beginning on text, images, and other modalities together, rather than having capabilities bolted on afterward.
Gemini 1.5 arrived in February 2024 with a groundbreaking 1-million-token context window (later expanded to 2 million)—dramatically larger than competitors at the time. In November 2025, Google announced Gemini 3, positioned as "our most intelligent model" with significant improvements in reasoning, multimodal understanding, and agent capabilities.
The Current Offering
Gemini Free
$0
Access to Gemini 2.5 Flash, basic features, limited Deep Research access.
Google AI Pro
$19.99/month
Gemini 2.5 Pro, 1M-token context, Deep Research, Workspace integration, 2TB storage.
Google AI Ultra
$249.99/month
Gemini 3 Pro with Deep Think, highest limits, 30TB storage, Veo 3 video generation.
Strengths
The Google ecosystem integration is Gemini's defining advantage. If your organization lives in Google Workspace, Gemini works natively within Gmail, Docs, Sheets, Slides, and Meet. The side panel in Google Docs can help you write, summarize, and refine. Gemini in Gmail drafts responses and summarizes threads. For organizations already committed to Google, this reduces friction dramatically.
Multimodal capabilities are genuinely strong. Gemini's ability to process video (not just images) allows analysis of recorded content, extraction of key moments, and summarization of visual information in ways that currently exceed competitors.
The context window remains best-in-class at 1-2 million tokens, allowing you to work with enormous documents or entire codebases without hitting limits.
For healthcare specifically, Google Workspace with Gemini is HIPAA-eligible. Google's HIPAA Included Functionality list explicitly covers Gemini in Workspace. Organizations that sign Google's Business Associate Amendment through the Admin Console can use these services with protected health information.
Limitations
Gemini's full potential requires the Google ecosystem. If you don't use Google Workspace, many of the integration advantages disappear. For organizations on Microsoft 365 or other platforms, the standalone Gemini experience is less compelling.
The pricing structure is more complex and the highest tier ($249.99/month for Ultra) is significantly more expensive than competitors' premium offerings. While it includes substantial bundled value (storage, YouTube Premium, advanced video generation), the headline price may exceed what many users need.
Side-by-Side Comparison
| Factor | ChatGPT | Claude | Gemini |
|---|---|---|---|
| Developer | OpenAI (Microsoft-backed) | Anthropic (Amazon/Google-backed) | Google DeepMind |
| Consumer Price Entry | $20/month (Plus) | $20/month (Pro) | $19.99/month (AI Pro) |
| Premium Tier | $200/month (Pro) | $200/month (Max 20x) | $249.99/month (Ultra) |
| Context Window | 128K tokens | 200K tokens | 1-2M tokens |
| Native Image Gen | Yes (DALL-E) | No | Yes (Imagen) |
| Voice Mode | Advanced Voice Mode | Limited | Gemini Live |
| Ecosystem | GPT Store, plugins | MCP integrations | Google Workspace |
| Writing Style | Engaging, occasionally verbose | Nuanced, thoughtful | Variable, improving |
HIPAA/BAA Comparison
| Platform | Consumer Chat BAA | API BAA | Easiest Path |
|---|---|---|---|
| ChatGPT | Enterprise/Edu only | Yes, via application | Azure OpenAI Service |
| Claude | No | Yes, with ZDR agreement | AWS Bedrock |
| Gemini | Workspace integration covered | Yes via Vertex AI | Google Workspace + BAA |
For the consumer chat interfaces most people use day-to-day (ChatGPT Plus, Claude Pro, standard Gemini), none are HIPAA-compliant and should never be used with PHI without additional safeguards.
What Each Model Does Best
Based on extensive user reports and benchmark data, here's where each model tends to excel:
ChatGPT Excels At
- Breadth of capability: One tool that does many things well
- Creative content: Marketing copy, engaging educational materials
- Image generation: Visuals for presentations or patient education
- Voice interaction: When you want to talk rather than type
- Ecosystem integration: Custom GPTs and plugins
- Step-by-step tutorials: Clear, accessible explanations
Claude Excels At
- Nuanced writing: Patient communications, clinical notes
- Complex analysis: Reasoning through ambiguous situations
- Long document work: Summarizing papers, analyzing protocols
- Coding and technical work: Especially via Claude Code
- Safety-sensitive applications: Conservative guardrails
- Academic writing: Papers, grants, formal documentation
Gemini Excels At
- Google Workspace integration: AI woven throughout your tools
- Video and multimodal analysis: Processing recorded content
- Massive context: Entire books, codebases, extensive docs
- Real-time assistance: Voice connected to calendar and email
- HIPAA-compliant productivity: Compliant daily workflow AI
- Data analysis: Spreadsheets and structured information
Practical Guidance: Just Start
The most important advice in this entire module is this: pick one and start using it.
The differences between these models matter at the margins. For the vast majority of tasks most clinicians will encounter, all three will produce useful results. The limiting factor isn't which model you choose—it's whether you've developed the skill to use it effectively.
Week 1: Establish a Baseline
Choose whichever model you have easiest access to. For your first week, use it for low-stakes tasks:
- Summarizing an article you're reading anyway
- Drafting an email you'll heavily edit
- Explaining a concept to yourself before explaining it to a patient
- Brainstorming questions for a meeting
Don't worry about optimization. Just get comfortable with the interaction pattern.
Week 2: Apply Prompting Principles
Revisit the prompting framework from Module 3 and apply it deliberately:
- Give the model a role ("You are helping me prepare for a difficult conversation with a parent")
- Provide context ("The patient is 8 years old with a new ADHD diagnosis")
- Be specific about format ("I need three main points, in language a non-medical parent would understand")
- Include constraints ("Avoid medical jargon; emphasize that this is manageable")
Notice how the quality of outputs changes as you prompt more skillfully.
Week 3: Try Something Harder
Push into a task that actually matters:
- Draft a real (de-identified) clinical note
- Summarize a complex patient case for a referral
- Create patient education materials for a condition you see frequently
- Analyze a clinical guideline and identify key practice implications
Evaluate the output critically. What did it get right? Where did it need correction? What would you prompt differently next time?
Week 4: Compare
Now try a second model with a task you've done before. Use the same prompt and compare outputs. You'll develop intuition for the differences—and often find that your preference is less about the model and more about how you've learned to work with it.
The Cost Question
Let's be direct about money.
Free tiers are sufficient for exploration and light use. If you're using AI occasionally—a few times a week for non-critical tasks—you may never need to pay.
$20/month is the standard paid tier. ChatGPT Plus, Claude Pro, and Google AI Pro all cluster around this price point. At this tier, you get higher usage limits, access to the best models, and priority features. For a professional tool you might use daily, $20/month is modest—less than many software subscriptions with narrower utility. This is where most regular users land.
Premium tiers ($200-250/month) are for power users. If you're running into limits on the $20 tier, pushing complex coding projects, or need maximum model capabilities for professional work, the premium tiers exist. But most users won't need them.
- Start with free tiers
- Upgrade to ~$20/month when you hit limits regularly
- Evaluate premium tiers only if you're a genuine power user
- Discuss organizational deployment with compliance and IT before rolling out team solutions
Understanding AI Benchmarks
You'll often see AI companies touting benchmark scores when announcing new models. Headlines declare one model "beats" another on some test. But what do these numbers actually mean—and more importantly, what don't they mean?
What Benchmarks Measure
Benchmarks are standardized tests designed to evaluate specific AI capabilities. They provide a common yardstick for comparing models on tasks like:
- Knowledge recall: Can the model answer questions across academic domains?
- Reasoning: Can it work through multi-step logic problems?
- Coding: Can it write functional code that solves programming challenges?
- Domain expertise: How does it perform on professional exams (medical, legal, etc.)?
What Benchmarks Don't Measure
Here's what no benchmark captures—and what often matters most for real-world use:
- Writing quality: Does the output sound natural and require minimal editing?
- Appropriate uncertainty: Does the model say "I don't know" when it should?
- Instruction following: Does it do what you actually asked, or what it thinks you asked?
- Consistency: Does it give similar quality outputs across multiple attempts?
- Your specific use case: A model that aces coding benchmarks might produce worse clinical notes than one that scores lower.
A model scoring 90% vs 88% on a knowledge test is not meaningfully different for most practical purposes. What matters is whether the model helps you do your work better. The only benchmark that truly matters is your own experience using the tool.
How to Use Benchmark Data
- As a rough filter: If a model scores dramatically lower on everything, it's probably less capable overall.
- For specific tasks: If you primarily need coding help, look at coding benchmarks. For medical questions, look at MedQA scores.
- With skepticism: Companies often cherry-pick favorable benchmarks. Independent testing (like LMArena) provides more balanced pictures.
- As a starting point: Use benchmarks to decide which 2-3 models to try, then let your own experience guide your choice.
Key Benchmarks Explained
| Benchmark | What It Tests | Why It Matters |
|---|---|---|
| Humanity's Last Exam | 2,500 expert-level questions across all disciplines, designed to be genuinely difficult | The hardest general benchmark; tests limits of AI reasoning |
| GPQA Diamond | PhD-level science questions (physics, chemistry, biology) | Deep reasoning in scientific domains |
| MedQA | USMLE-style medical licensing questions | Medical knowledge directly relevant to clinical practice |
| SimpleQA | Short fact-seeking questions with single correct answers | Measures hallucination rate and factual accuracy |
| SimpleBench | Common-sense reasoning that humans find easy but AI finds hard | Tests practical reasoning vs pattern matching |
| LMArena Elo | Human preference ratings from blind head-to-head comparisons | Real users choosing preferred outputs—closest to "usefulness" |
| SWE-Bench | Fixing real bugs in actual open-source codebases | Real-world software engineering capability |
Current Benchmark Results (November 2025)
Below are scores for the current flagship models: OpenAI's GPT-5.1, Anthropic's Claude Opus 4.5, and Google's Gemini 3 Pro. These represent the state of the art as of late November 2025.
| Benchmark | GPT-5.1 | Claude Opus 4.5 | Gemini 3 Pro | Notes |
|---|---|---|---|---|
| Humanity's Last Exam | ~36% | ~35% | 45.8% | Gemini leads on hardest reasoning test |
| GPQA Diamond | 88.1% | 87.0% | 91.9% | Gemini ahead on PhD-level science |
| MedQA | ~96% | ~94% | ~93% | All far exceed passing threshold (~60%) |
| SimpleQA (Factuality) | ~63% | ~45% | 72.1% | Gemini leads on factual accuracy |
| LMArena Elo | ~1480 | ~1470 | 1501 | Gemini tops human preference ratings |
| SWE-Bench Verified | 76.3% | 80.9% | 76.2% | Claude leads on real-world coding |
As of November 2025, Gemini 3 Pro leads on reasoning benchmarks (Humanity's Last Exam, GPQA) and factual accuracy (SimpleQA). Claude Opus 4.5 dominates real-world coding tasks (SWE-Bench). GPT-5.1 excels on medical knowledge (MedQA) and offers the most mature ecosystem. But here's what matters: all three score 93%+ on MedQA—far above the passing threshold. For clinical use, all are capable enough. Your choice should depend on ecosystem fit, pricing, and personal preference rather than benchmark margins.
Common Pitfalls and How to Avoid Them
Quick Reference: Getting Started
ChatGPT
URL: chat.openai.com
Mobile: iOS and Android apps
Sign up: Google, Microsoft, or email
Best for: Broad capabilities, images, mature ecosystem
Claude
URL: claude.ai
Mobile: iOS and Android apps
Sign up: Google, email, or phone
Best for: Nuanced writing, analysis, coding, long documents
Gemini
URL: gemini.google.com
Mobile: iOS and Android apps
Sign up: Google account required
Best for: Google Workspace, massive context, HIPAA via Workspace
Action Items
Before moving to the next module, complete at least one of these:
- Create accounts on all three platforms (if you haven't already). Even if you end up preferring one, having tried the others gives you useful context.
- Run the same prompt through all three and compare outputs. Notice the differences in tone, organization, and approach.
- Try something you actually need: Draft a real email, summarize a real article, prepare for a real conversation. Experience how the tool performs on your actual work.
- Hit a limit and recover: Deliberately work on something complex enough that your first output isn't good. Practice iterative refinement until you're satisfied.
- If you're in a healthcare organization: Identify which HIPAA pathway makes sense for your situation and discuss it with appropriate stakeholders.
Summary
ChatGPT, Claude, and Gemini are all capable foundation models that can meaningfully assist clinical work. ChatGPT offers the broadest ecosystem and most mature feature set. Claude provides nuanced reasoning and a safety-first approach. Gemini integrates deeply with Google Workspace and offers the easiest HIPAA pathway for organizations already in that ecosystem.
The choice between them matters less than developing skill with whichever you choose. These are tools that respond to how you use them. A well-crafted prompt to any of these models will outperform a vague prompt to the "best" model.
Think of them as brilliant but inexperienced colleagues: genuinely helpful, occasionally wrong, always requiring supervision. With that framing and the prompting skills from earlier modules, you're ready to begin. Now close this document and go have a conversation with one of them. That's where the real learning happens.
Learning Objectives
- Compare the capabilities, strengths, and limitations of ChatGPT, Claude, and Gemini
- Identify which model best fits specific clinical use cases
- Understand HIPAA compliance pathways for each platform
- Apply a practical framework for getting started with foundation models
- Recognize common pitfalls in AI adoption and strategies to avoid them
Notes
-
Why not Grok? You may wonder why xAI's Grok isn't included here.
While Grok has some capable underlying technology, we don't recommend it for clinical
use due to significant concerns about accuracy and safety guardrails.
In early 2025, Grok generated and spread false information about prominent public figures, including fabricated claims about an NBA owner that were widely amplified on X (formerly Twitter). The system was also found to produce misleading election-related content and struggled with basic factual queries where other models performed reliably. Independent evaluations have noted that Grok's safety measures are notably weaker than those of ChatGPT, Claude, or Gemini—the model is more likely to generate harmful content when prompted.
For clinical applications where accuracy and appropriate guardrails matter, these issues are disqualifying. The three platforms covered in this module have demonstrated more robust approaches to safety and factual accuracy—though as we discuss throughout, all AI outputs require verification.
References:
↩
Newsweek. "Mark Cuban Confronts Elon Musk Using His Own AI Bot." 2025.
Axios. "Musk's AI chatbot spread election misinformation, secretaries of state say." August 2024.
Center for Countering Digital Hate. "Grok AI Election Disinformation." 2024.
Palo Alto Networks Unit 42. "How Good Are the LLM Guardrails on the Market?." 2024.