The Big Three
ChatGPT, Claude, and Gemini—your guide to today's leading foundation models and how to choose between them.
Which foundation model should you use—and does it actually matter?
Starting Here? Read This First
If you've jumped straight to this module hoping to pick a model and get started, you're in good company—this is exactly what most people want. But foundation models are tools, and like any tool, their value depends on how skillfully you use them. Before you dive too deep here, consider at least skimming these foundational concepts from earlier modules:
From Module 1 (How LLMs Think): These models work by predicting the most likely next word in a sequence, trained on enormous datasets of human-generated text. They don't "know" things the way you do—they recognize patterns. This matters because it explains both their remarkable capabilities and their characteristic failures.
From Module 2 (PHI and HIPAA): None of these consumer-facing chat interfaces are HIPAA-compliant out of the box. We'll discuss BAA pathways later in this module, but the critical principle remains: never enter PHI into a consumer AI product without proper safeguards.
From Module 3 (Prompting): The quality of your output depends enormously on the quality of your input. A vague prompt produces vague results. A well-structured prompt with context, role, and constraints produces dramatically better outputs. We'll reference prompting principles throughout this module—they apply equally to all three platforms.
Now, let's meet the models.[1]
The Med Student Analogy
Think of each foundation model as a brilliant medical student who has read essentially everything ever published—every textbook, every journal article, every clinical guideline, every case study, and frankly, every Reddit thread and random blog post too. This student has near-perfect recall of patterns across all that material and can synthesize information across domains in ways that would take you hours or days.
But here's what's crucial: this med student has never actually seen a patient. They haven't felt the resistance of tissue, watched a parent's face crumple at difficult news, or learned from the case that didn't follow the textbook. They know what clinical reasoning looks like on paper, but they don't have clinical judgment.
This framing helps calibrate expectations:
- They're genuinely useful for the kind of work where pattern recognition and information synthesis matter—drafting notes, summarizing literature, explaining concepts, generating differential diagnoses for discussion.
- They require supervision the same way any trainee does. You wouldn't let even a brilliant third-year student sign notes unsupervised, and you shouldn't let an AI do so either.
- They need clear instructions. Just as you'd give a student specific guidance ("I need a note that addresses the parent's concern about developmental delay, focuses on what we observed today, and includes our reasoning for not ordering imaging"), you need to give the model explicit context and constraints.
- They sometimes confabulate. A student who doesn't know something might make up an answer that sounds plausible rather than saying "I don't know." These models do the same—we call it "hallucination," but it's really just pattern-matching in the absence of actual knowledge. This is why you verify before you trust.
With that framing in mind, let's look at who these three "students" are and what each brings to your practice.
Knowing Your Options
Before diving into each model, it's worth understanding the landscape:
- ChatGPT is the household name. For most people, "ChatGPT" is synonymous with AI chat. It has the largest consumer user base and the most cultural awareness. When your patients or colleagues mention "AI," they're usually thinking of ChatGPT.
- Many people don't realize they already have Gemini. If you have a Google account, you have access to Gemini. It's built into Google Search, available at gemini.google.com, and integrated throughout Google Workspace. Yet many users have never tried it.
- Claude is often discovered second. People typically find Claude when looking for alternatives, exploring options for specific use cases, or hearing recommendations from colleagues. It has a smaller but dedicated user base.
None of this tells you which is "best"—that depends entirely on your needs, your ecosystem, and your preferences. The point is simply: you have options, and it's worth knowing what they are.
ChatGPT: The First Mover
The Story
OpenAI was founded in December 2015 by Sam Altman, Elon Musk, and others with the stated mission of developing artificial general intelligence that benefits humanity. The company initially operated as a nonprofit research lab before restructuring in 2019 to attract the investment needed for increasingly expensive AI training.
The GPT (Generative Pre-trained Transformer) architecture emerged from this research, with GPT-1 in 2018, GPT-2 in 2019 (initially withheld due to concerns about misuse), and GPT-3 in 2020. But the inflection point came on November 30, 2022, when OpenAI released ChatGPT as a free research preview. Within five days, it had a million users. Within two months, it reached 100 million—the fastest-growing consumer application in history.
That explosive growth fundamentally changed how the world understood AI. Suddenly, anyone could have a conversation with a system that felt like talking to a knowledgeable colleague. The technology wasn't new, but the accessibility was.
The Current Offering
As of late 2025, ChatGPT operates across several tiers:
ChatGPT Free
$0
Access to GPT-4o with usage limits. Good for exploration and occasional use.
ChatGPT Plus
$20/month
Higher limits, priority access, o1 reasoning models, DALL-E, advanced voice mode.
ChatGPT Pro
$200/month
For power users. o1 pro mode, extended features, essentially unlimited usage.
ChatGPT Team
$25-30/user/month
Collaborative workspace. Data not used for training. Still not HIPAA-compliant without additional measures.
What You Get
Ecosystem: The most mature AI ecosystem. The GPT Store contains customized applications for specific use cases, extensive plugin support, and integrations with tools many people already use. If you want to find a pre-built solution for a specific task, ChatGPT's ecosystem is the most likely place to find it.
Features: Advanced Voice Mode for natural conversation, image generation through DALL-E, Code Interpreter for data analysis and visualization, and web browsing for current information.
HIPAA pathway: BAAs available for API services and Enterprise/Edu plans with sales-managed accounts. ChatGPT Free, Plus, Pro, and Team plans are explicitly not covered by BAAs and cannot be used with protected health information.
Claude: The Safety-First Approach
The Story
Anthropic was founded in 2021 by Dario and Daniela Amodei, along with several other former OpenAI researchers and executives. The founding team included key figures in AI safety research, and that orientation shaped the company's approach from the beginning.
The company developed "Constitutional AI," a training approach that uses AI feedback (rather than exclusively human feedback) to shape model behavior according to a set of principles. The goal was to create systems that are helpful but also harmless and honest—what Anthropic describes as the "HHH" framework.
Claude 1.0 launched in March 2023, positioning itself as a thoughtful alternative to ChatGPT. The Claude 3 family arrived in March 2024, introducing the Haiku/Sonnet/Opus tiering (small/medium/large models with different capability and cost profiles). By late 2025, Claude Opus 4.5 emerged as the latest iteration, with Anthropic positioning it as "the best model in the world for coding, agents, and computer use."
The Current Offering
Claude Free
$0
Access to Claude with usage limits that reset every few hours. Good for exploration.
Claude Pro
$20/month
~5x usage of free tier, all models including Opus, priority access to new features.
Claude Max
$100-200/month
5x-20x Pro limits. Designed for power users, especially Claude Code developers.
Claude Team
$25-30/user/month
Collaborative features, admin controls. Data not used for training. Min 5 seats.
What You Get
Ecosystem: Smaller than ChatGPT's but growing. The MCP (Model Context Protocol) allows integrations with external tools and data sources. Claude Code provides command-line AI assistance for developers. The ecosystem emphasizes depth over breadth.
Features: 200,000-token context window (roughly 150,000 words) for working with substantial documents. Projects feature for organizing related conversations. Artifacts for generating code, documents, and other outputs. No native image generation, but can analyze images you provide.
HIPAA pathway: BAAs available through the API with Zero Data Retention (ZDR) agreement. Consumer chat interface (Claude.ai) is not covered. AWS Bedrock provides the most straightforward enterprise pathway.
Gemini: The Integration Play
The Story
Google's path to Gemini began with earlier AI efforts including LaMDA (the model behind the original Bard chatbot) and PaLM (Pathways Language Model). But Google's extensive AI research, which predates the current generation by decades, positioned the company as a natural major player once the race began.
Gemini was announced in December 2023 as Google's response to GPT-4, with the company emphasizing its native multimodal training—the model was trained from the beginning on text, images, and other modalities together, rather than having capabilities bolted on afterward.
Gemini 1.5 arrived in February 2024 with a groundbreaking 1-million-token context window (later expanded to 2 million)—dramatically larger than competitors at the time. In November 2025, Google announced Gemini 3, positioned as "our most intelligent model" with significant improvements in reasoning, multimodal understanding, and agent capabilities.
The Current Offering
Gemini Free
$0
Access to Gemini 2.5 Flash, basic features, limited Deep Research access.
Google AI Pro
$19.99/month
Gemini 2.5 Pro, 1M-token context, Deep Research, Workspace integration, 2TB storage.
Google AI Ultra
$249.99/month
Gemini 3 Pro with Deep Think, highest limits, 30TB storage, Veo 3 video generation.
What You Get
Ecosystem: Deep Google Workspace integration. If your organization lives in Gmail, Docs, Sheets, and Meet, Gemini works natively within those tools—the side panel in Google Docs helps you write, Gemini in Gmail drafts responses and summarizes threads. For organizations already committed to Google, this reduces friction dramatically. If you don't use Google Workspace, much of this advantage disappears.
Features: 1-2 million token context window—the largest available—for working with enormous documents or entire codebases. Strong multimodal capabilities including video analysis. Image generation through Imagen. Gemini Live for voice interaction.
HIPAA pathway: Google Workspace with Gemini is explicitly HIPAA-eligible. Google's HIPAA Included Functionality list covers Gemini in Workspace. Organizations that sign Google's Business Associate Amendment through the Admin Console can use these services with protected health information. This is currently the most straightforward consumer-tier HIPAA pathway.
Side-by-Side Comparison
| Factor | ChatGPT | Claude | Gemini |
|---|---|---|---|
| Developer | OpenAI (Microsoft-backed) | Anthropic (Amazon/Google-backed) | Google DeepMind |
| Consumer Price Entry | $20/month (Plus) | $20/month (Pro) | $19.99/month (AI Pro) |
| Premium Tier | $200/month (Pro) | $200/month (Max 20x) | $249.99/month (Ultra) |
| Context Window | 128K tokens | 200K tokens | 1-2M tokens |
| Native Image Gen | Yes (DALL-E) | No | Yes (Imagen) |
| Voice Mode | Advanced Voice Mode | Limited | Gemini Live |
| Ecosystem | GPT Store, plugins | MCP integrations | Google Workspace |
HIPAA/BAA Comparison
| Platform | Consumer Chat BAA | API BAA | Easiest Path |
|---|---|---|---|
| ChatGPT | Enterprise/Edu only | Yes, via application | Azure OpenAI Service |
| Claude | No | Yes, with ZDR agreement | AWS Bedrock |
| Gemini | Workspace integration covered | Yes via Vertex AI | Google Workspace + BAA |
For the consumer chat interfaces most people use day-to-day (ChatGPT Plus, Claude Pro, standard Gemini), none are HIPAA-compliant and should never be used with PHI without additional safeguards.
Just Pick One and Start
Here's the honest truth: for most clinical use cases, all three models are capable enough. The differences between them matter at the margins—and those margins shift with every model update anyway.
Don't overthink the choice. Pick based on:
- What you already have access to. Already in Google Workspace? Try Gemini. Have a ChatGPT account from when everyone was talking about it? Use that.
- What your colleagues use. There's value in being able to share prompts and tips with people doing similar work.
- Whichever interface you prefer. Seriously—if one feels better to use, that matters.
You can always switch later. You can use multiple models for different tasks. The skill you develop—learning to prompt effectively, knowing when to trust outputs, building useful workflows—transfers across all of them.
The next section gives you a practical plan for getting started. After that, we'll cover how to evaluate models with your own use cases over time.
Practical Guidance: Your First Month
Week 1: Establish a Baseline
Choose whichever model you have easiest access to. For your first week, use it for low-stakes tasks:
- Summarizing an article you're reading anyway
- Drafting an email you'll heavily edit
- Explaining a concept to yourself before explaining it to a patient
- Brainstorming questions for a meeting
Don't worry about optimization. Just get comfortable with the interaction pattern.
Week 2: Apply Prompting Principles
Revisit the prompting framework from Module 3 and apply it deliberately:
- Give the model a role ("You are helping me prepare for a difficult conversation with a parent")
- Provide context ("The patient is 8 years old with a new ADHD diagnosis")
- Be specific about format ("I need three main points, in language a non-medical parent would understand")
- Include constraints ("Avoid medical jargon; emphasize that this is manageable")
Notice how the quality of outputs changes as you prompt more skillfully.
Week 3: Try Something Harder
Push into a task that actually matters:
- Draft a real (de-identified) clinical note
- Summarize a complex patient case for a referral
- Create patient education materials for a condition you see frequently
- Analyze a clinical guideline and identify key practice implications
Evaluate the output critically. What did it get right? Where did it need correction? What would you prompt differently next time?
Week 4: Compare
Now try a second model with a task you've done before. Use the same prompt and compare outputs. You'll develop intuition for the differences—and often find that your preference is less about the model and more about how you've learned to work with it.
Evaluate With Your Own Use Cases
Here's a truth that benchmark tables and feature comparisons can't capture: the only evaluation that matters is how a model performs on your actual work.
Build Your Personal Test Set
Create a small set of 3-5 tasks that represent your real work:
- A type of document you frequently draft (note, letter, summary)
- A question you commonly need to research or explain
- A complex case or scenario you've worked through before
- Something where you know what "good" looks like
Run these same prompts through different models. Compare the outputs. Which required less editing? Which understood your intent better? Which produced something you'd actually use?
Re-Evaluate After Updates
This is crucial and often overlooked: a model that didn't work for you six months ago might be excellent now. And vice versa—a model you loved might change in ways that don't suit your workflow.
Each major model update (GPT-4 to GPT-4o, Claude 3 to Claude 4, etc.) can significantly change how the model handles specific tasks. Some examples:
- A clinical note format that one model version struggled with might work perfectly in the next
- A type of analysis that frustrated you might become seamless after an upgrade
- Conversely, a workflow you'd perfected might break when the model changes
When you see announcements about major model updates, revisit your test set. Don't assume your current choice is still the best choice—or that a model you dismissed is still inadequate.
Set a calendar reminder every 3-6 months to run your personal test set across the current versions of each model. This takes 30 minutes and ensures you're always using the best tool for your needs—not just the one you happened to start with.
The Cost Question
Let's be direct about money.
Free tiers are sufficient for exploration and light use. If you're using AI occasionally—a few times a week for non-critical tasks—you may never need to pay.
$20/month is the standard paid tier. ChatGPT Plus, Claude Pro, and Google AI Pro all cluster around this price point. At this tier, you get higher usage limits, access to the best models, and priority features. For a professional tool you might use daily, $20/month is modest—less than many software subscriptions with narrower utility. This is where most regular users land.
Premium tiers ($200-250/month) are for power users. If you're running into limits on the $20 tier, pushing complex coding projects, or need maximum model capabilities for professional work, the premium tiers exist. But most users won't need them.
- Start with free tiers
- Upgrade to ~$20/month when you hit limits regularly
- Evaluate premium tiers only if you're a genuine power user
- Discuss organizational deployment with compliance and IT before rolling out team solutions
Understanding AI Benchmarks
This section explains benchmarks because you'll encounter them in AI discussions. But here's the key point upfront: benchmark scores tell you very little about how useful a model will be for your specific work. A 2% difference on a test doesn't translate to a meaningfully better tool for writing clinical notes or explaining diagnoses. Read this section for context, then focus on your own evaluation.
You'll often see AI companies touting benchmark scores when announcing new models. Headlines declare one model "beats" another on some test. But what do these numbers actually mean—and more importantly, what don't they mean?
What Benchmarks Measure
Benchmarks are standardized tests designed to evaluate specific AI capabilities. They provide a common yardstick for comparing models on tasks like:
- Knowledge recall: Can the model answer questions across academic domains?
- Reasoning: Can it work through multi-step logic problems?
- Coding: Can it write functional code that solves programming challenges?
- Domain expertise: How does it perform on professional exams (medical, legal, etc.)?
What Benchmarks Don't Measure
Here's what no benchmark captures—and what often matters most for real-world use:
- Writing quality: Does the output sound natural and require minimal editing?
- Appropriate uncertainty: Does the model say "I don't know" when it should?
- Instruction following: Does it do what you actually asked, or what it thinks you asked?
- Consistency: Does it give similar quality outputs across multiple attempts?
- Your specific use case: A model that aces coding benchmarks might produce worse clinical notes than one that scores lower.
A model scoring 90% vs 88% on a knowledge test is not meaningfully different for most practical purposes. What matters is whether the model helps you do your work better. The only benchmark that truly matters is your own experience using the tool.
How to Use Benchmark Data
- As a rough filter: If a model scores dramatically lower on everything, it's probably less capable overall.
- For specific tasks: If you primarily need coding help, look at coding benchmarks. For medical questions, look at MedQA scores.
- With skepticism: Companies often cherry-pick favorable benchmarks. Independent testing (like LMArena) provides more balanced pictures.
- As a starting point: Use benchmarks to decide which 2-3 models to try, then let your own experience guide your choice.
Key Benchmarks Explained
| Benchmark | What It Tests | Why It Matters |
|---|---|---|
| Humanity's Last Exam | 2,500 expert-level questions across all disciplines, designed to be genuinely difficult | The hardest general benchmark; tests limits of AI reasoning |
| GPQA Diamond | PhD-level science questions (physics, chemistry, biology) | Deep reasoning in scientific domains |
| MedQA | USMLE-style medical licensing questions | Medical knowledge directly relevant to clinical practice |
| SimpleQA | Short fact-seeking questions with single correct answers | Measures hallucination rate and factual accuracy |
| SimpleBench | Common-sense reasoning that humans find easy but AI finds hard | Tests practical reasoning vs pattern matching |
| LMArena Elo | Human preference ratings from blind head-to-head comparisons | Real users choosing preferred outputs—closest to "usefulness" |
| SWE-Bench | Fixing real bugs in actual open-source codebases | Real-world software engineering capability |
Benchmark Results (A Snapshot in Time)
Below are scores for flagship models as of late 2025. These numbers will be outdated quickly—new model versions release every few months, and the leaderboard constantly shifts. We include them to illustrate the general landscape, not to make a definitive ranking.
| Benchmark | GPT-5.1 | Claude Opus 4.5 | Gemini 3 Pro | Notes |
|---|---|---|---|---|
| Humanity's Last Exam | ~36% | ~35% | 45.8% | Gemini leads on hardest reasoning test |
| GPQA Diamond | 88.1% | 87.0% | 91.9% | Gemini ahead on PhD-level science |
| MedQA | ~96% | ~94% | ~93% | All far exceed passing threshold (~60%) |
| SimpleQA (Factuality) | ~63% | ~45% | 72.1% | Gemini leads on factual accuracy |
| LMArena Elo | ~1480 | ~1470 | 1501 | Gemini tops human preference ratings |
| SWE-Bench Verified | 76.3% | 80.9% | 76.2% | Claude leads on real-world coding |
The numbers in this table will shift with every model release. What matters is this: all three models score 93%+ on MedQA—far above the passing threshold for medical licensing exams. For the vast majority of clinical use cases, all three are capable enough. Don't choose based on benchmark margins. Choose based on ecosystem fit, pricing, and—most importantly—how well the model actually performs on your specific work.
Common Pitfalls and How to Avoid Them
Quick Reference: Getting Started
ChatGPT
URL: chat.openai.com
Mobile: iOS and Android apps
Sign up: Google, Microsoft, or email
Best for: Broad capabilities, images, mature ecosystem
Claude
URL: claude.ai
Mobile: iOS and Android apps
Sign up: Google, email, or phone
Best for: Nuanced writing, analysis, coding, long documents
Gemini
URL: gemini.google.com
Mobile: iOS and Android apps
Sign up: Google account required
Best for: Google Workspace, massive context, HIPAA via Workspace
Action Items
Before moving to the next module, complete at least one of these:
- Create accounts on all three platforms (if you haven't already). Even if you end up preferring one, having tried the others gives you useful context.
- Run the same prompt through all three and compare outputs. Notice the differences in tone, organization, and approach.
- Try something you actually need: Draft a real email, summarize a real article, prepare for a real conversation. Experience how the tool performs on your actual work.
- Hit a limit and recover: Deliberately work on something complex enough that your first output isn't good. Practice iterative refinement until you're satisfied.
- If you're in a healthcare organization: Identify which HIPAA pathway makes sense for your situation and discuss it with appropriate stakeholders.
Summary
ChatGPT, Claude, and Gemini are all capable foundation models that can meaningfully assist clinical work. Each has different ecosystem integrations, pricing structures, and HIPAA pathways. ChatGPT has the largest user base and most extensive ecosystem. Gemini integrates deeply with Google Workspace and offers the most straightforward consumer HIPAA pathway. Claude offers the largest context window outside of Gemini and strong developer tools.
But here's what matters most: the choice between them matters far less than developing skill with whichever you choose. These are tools that respond to how you use them. A well-crafted prompt to any of these models will outperform a vague prompt to the "best" model.
And remember: your evaluation shouldn't be a one-time event. Models improve, workflows change, and what didn't work last year might work beautifully now. Build the habit of periodic re-evaluation with your own real-world test cases.
Think of them as brilliant but inexperienced colleagues: genuinely helpful, occasionally wrong, always requiring supervision. Pick one based on your ecosystem and access. Use it enough to develop skill. Periodically test alternatives with your own use cases. With that approach and the prompting skills from earlier modules, you're ready to begin. Now close this document and go have a conversation with one of them. That's where the real learning happens.
Learning Objectives
- Compare the capabilities, strengths, and limitations of ChatGPT, Claude, and Gemini
- Identify which model best fits specific clinical use cases
- Understand HIPAA compliance pathways for each platform
- Apply a practical framework for getting started with foundation models
- Recognize common pitfalls in AI adoption and strategies to avoid them
Notes
-
Why not Grok? You may wonder why xAI's Grok isn't included here.
While Grok has some capable underlying technology, we don't recommend it for clinical
use due to significant concerns about accuracy and safety guardrails.
In early 2025, Grok generated and spread false information about prominent public figures, including fabricated claims about an NBA owner that were widely amplified on X (formerly Twitter). The system was also found to produce misleading election-related content and struggled with basic factual queries where other models performed reliably. Independent evaluations have noted that Grok's safety measures are notably weaker than those of ChatGPT, Claude, or Gemini—the model is more likely to generate harmful content when prompted.
For clinical applications where accuracy and appropriate guardrails matter, these issues are disqualifying. The three platforms covered in this module have demonstrated more robust approaches to safety and factual accuracy—though as we discuss throughout, all AI outputs require verification.
References:
↩
Newsweek. "Mark Cuban Confronts Elon Musk Using His Own AI Bot." 2025.
Axios. "Musk's AI chatbot spread election misinformation, secretaries of state say." August 2024.
Center for Countering Digital Hate. "Grok AI Election Disinformation." 2024.
Palo Alto Networks Unit 42. "How Good Are the LLM Guardrails on the Market?." 2024.