FOUNDATIONS

The Big Three

ChatGPT, Claude, and Gemini—your guide to today's leading foundation models and how to choose between them.

~30 min read Practical guide
Core Question

Which foundation model should you use—and does it actually matter?

Starting Here? Read This First

If you've jumped straight to this module hoping to pick a model and get started, you're in good company—this is exactly what most people want. But foundation models are tools, and like any tool, their value depends on how skillfully you use them. Before you dive too deep here, consider at least skimming these foundational concepts from earlier modules:

Key Concepts from Earlier Modules

From Module 1 (How LLMs Think): These models work by predicting the most likely next word in a sequence, trained on enormous datasets of human-generated text. They don't "know" things the way you do—they recognize patterns. This matters because it explains both their remarkable capabilities and their characteristic failures.

From Module 2 (PHI and HIPAA): None of these consumer-facing chat interfaces are HIPAA-compliant out of the box. We'll discuss BAA pathways later in this module, but the critical principle remains: never enter PHI into a consumer AI product without proper safeguards.

From Module 3 (Prompting): The quality of your output depends enormously on the quality of your input. A vague prompt produces vague results. A well-structured prompt with context, role, and constraints produces dramatically better outputs. We'll reference prompting principles throughout this module—they apply equally to all three platforms.

Now, let's meet the models.[1]


The Med Student Analogy

Think of each foundation model as a brilliant medical student who has read essentially everything ever published—every textbook, every journal article, every clinical guideline, every case study, and frankly, every Reddit thread and random blog post too. This student has near-perfect recall of patterns across all that material and can synthesize information across domains in ways that would take you hours or days.

But here's what's crucial: this med student has never actually seen a patient. They haven't felt the resistance of tissue, watched a parent's face crumple at difficult news, or learned from the case that didn't follow the textbook. They know what clinical reasoning looks like on paper, but they don't have clinical judgment.

This framing helps calibrate expectations:

With that framing in mind, let's look at who these three "students" are and what each brings to your practice.


Knowing Your Options

Before diving into each model, it's worth understanding the landscape:

None of this tells you which is "best"—that depends entirely on your needs, your ecosystem, and your preferences. The point is simply: you have options, and it's worth knowing what they are.


ChatGPT: The First Mover

The Story

OpenAI was founded in December 2015 by Sam Altman, Elon Musk, and others with the stated mission of developing artificial general intelligence that benefits humanity. The company initially operated as a nonprofit research lab before restructuring in 2019 to attract the investment needed for increasingly expensive AI training.

The GPT (Generative Pre-trained Transformer) architecture emerged from this research, with GPT-1 in 2018, GPT-2 in 2019 (initially withheld due to concerns about misuse), and GPT-3 in 2020. But the inflection point came on November 30, 2022, when OpenAI released ChatGPT as a free research preview. Within five days, it had a million users. Within two months, it reached 100 million—the fastest-growing consumer application in history.

That explosive growth fundamentally changed how the world understood AI. Suddenly, anyone could have a conversation with a system that felt like talking to a knowledgeable colleague. The technology wasn't new, but the accessibility was.

The Current Offering

As of late 2025, ChatGPT operates across several tiers:

ChatGPT Free

$0

Access to GPT-4o with usage limits. Good for exploration and occasional use.

ChatGPT Plus

$20/month

Higher limits, priority access, o1 reasoning models, DALL-E, advanced voice mode.

ChatGPT Pro

$200/month

For power users. o1 pro mode, extended features, essentially unlimited usage.

ChatGPT Team

$25-30/user/month

Collaborative workspace. Data not used for training. Still not HIPAA-compliant without additional measures.

What You Get

Ecosystem: The most mature AI ecosystem. The GPT Store contains customized applications for specific use cases, extensive plugin support, and integrations with tools many people already use. If you want to find a pre-built solution for a specific task, ChatGPT's ecosystem is the most likely place to find it.

Features: Advanced Voice Mode for natural conversation, image generation through DALL-E, Code Interpreter for data analysis and visualization, and web browsing for current information.

HIPAA pathway: BAAs available for API services and Enterprise/Edu plans with sales-managed accounts. ChatGPT Free, Plus, Pro, and Team plans are explicitly not covered by BAAs and cannot be used with protected health information.


Claude: The Safety-First Approach

The Story

Anthropic was founded in 2021 by Dario and Daniela Amodei, along with several other former OpenAI researchers and executives. The founding team included key figures in AI safety research, and that orientation shaped the company's approach from the beginning.

The company developed "Constitutional AI," a training approach that uses AI feedback (rather than exclusively human feedback) to shape model behavior according to a set of principles. The goal was to create systems that are helpful but also harmless and honest—what Anthropic describes as the "HHH" framework.

Claude 1.0 launched in March 2023, positioning itself as a thoughtful alternative to ChatGPT. The Claude 3 family arrived in March 2024, introducing the Haiku/Sonnet/Opus tiering (small/medium/large models with different capability and cost profiles). By late 2025, Claude Opus 4.5 emerged as the latest iteration, with Anthropic positioning it as "the best model in the world for coding, agents, and computer use."

The Current Offering

Claude Free

$0

Access to Claude with usage limits that reset every few hours. Good for exploration.

Claude Pro

$20/month

~5x usage of free tier, all models including Opus, priority access to new features.

Claude Max

$100-200/month

5x-20x Pro limits. Designed for power users, especially Claude Code developers.

Claude Team

$25-30/user/month

Collaborative features, admin controls. Data not used for training. Min 5 seats.

What You Get

Ecosystem: Smaller than ChatGPT's but growing. The MCP (Model Context Protocol) allows integrations with external tools and data sources. Claude Code provides command-line AI assistance for developers. The ecosystem emphasizes depth over breadth.

Features: 200,000-token context window (roughly 150,000 words) for working with substantial documents. Projects feature for organizing related conversations. Artifacts for generating code, documents, and other outputs. No native image generation, but can analyze images you provide.

HIPAA pathway: BAAs available through the API with Zero Data Retention (ZDR) agreement. Consumer chat interface (Claude.ai) is not covered. AWS Bedrock provides the most straightforward enterprise pathway.


Gemini: The Integration Play

The Story

Google's path to Gemini began with earlier AI efforts including LaMDA (the model behind the original Bard chatbot) and PaLM (Pathways Language Model). But Google's extensive AI research, which predates the current generation by decades, positioned the company as a natural major player once the race began.

Gemini was announced in December 2023 as Google's response to GPT-4, with the company emphasizing its native multimodal training—the model was trained from the beginning on text, images, and other modalities together, rather than having capabilities bolted on afterward.

Gemini 1.5 arrived in February 2024 with a groundbreaking 1-million-token context window (later expanded to 2 million)—dramatically larger than competitors at the time. In November 2025, Google announced Gemini 3, positioned as "our most intelligent model" with significant improvements in reasoning, multimodal understanding, and agent capabilities.

The Current Offering

Gemini Free

$0

Access to Gemini 2.5 Flash, basic features, limited Deep Research access.

Google AI Pro

$19.99/month

Gemini 2.5 Pro, 1M-token context, Deep Research, Workspace integration, 2TB storage.

Google AI Ultra

$249.99/month

Gemini 3 Pro with Deep Think, highest limits, 30TB storage, Veo 3 video generation.

What You Get

Ecosystem: Deep Google Workspace integration. If your organization lives in Gmail, Docs, Sheets, and Meet, Gemini works natively within those tools—the side panel in Google Docs helps you write, Gemini in Gmail drafts responses and summarizes threads. For organizations already committed to Google, this reduces friction dramatically. If you don't use Google Workspace, much of this advantage disappears.

Features: 1-2 million token context window—the largest available—for working with enormous documents or entire codebases. Strong multimodal capabilities including video analysis. Image generation through Imagen. Gemini Live for voice interaction.

HIPAA pathway: Google Workspace with Gemini is explicitly HIPAA-eligible. Google's HIPAA Included Functionality list covers Gemini in Workspace. Organizations that sign Google's Business Associate Amendment through the Admin Console can use these services with protected health information. This is currently the most straightforward consumer-tier HIPAA pathway.


Side-by-Side Comparison

Factor ChatGPT Claude Gemini
Developer OpenAI (Microsoft-backed) Anthropic (Amazon/Google-backed) Google DeepMind
Consumer Price Entry $20/month (Plus) $20/month (Pro) $19.99/month (AI Pro)
Premium Tier $200/month (Pro) $200/month (Max 20x) $249.99/month (Ultra)
Context Window 128K tokens 200K tokens 1-2M tokens
Native Image Gen Yes (DALL-E) No Yes (Imagen)
Voice Mode Advanced Voice Mode Limited Gemini Live
Ecosystem GPT Store, plugins MCP integrations Google Workspace

HIPAA/BAA Comparison

Platform Consumer Chat BAA API BAA Easiest Path
ChatGPT Enterprise/Edu only Yes, via application Azure OpenAI Service
Claude No Yes, with ZDR agreement AWS Bedrock
Gemini Workspace integration covered Yes via Vertex AI Google Workspace + BAA
Critical Reminder

For the consumer chat interfaces most people use day-to-day (ChatGPT Plus, Claude Pro, standard Gemini), none are HIPAA-compliant and should never be used with PHI without additional safeguards.


Just Pick One and Start

Here's the honest truth: for most clinical use cases, all three models are capable enough. The differences between them matter at the margins—and those margins shift with every model update anyway.

Don't overthink the choice. Pick based on:

You can always switch later. You can use multiple models for different tasks. The skill you develop—learning to prompt effectively, knowing when to trust outputs, building useful workflows—transfers across all of them.

The next section gives you a practical plan for getting started. After that, we'll cover how to evaluate models with your own use cases over time.


Practical Guidance: Your First Month

Week 1: Establish a Baseline

Choose whichever model you have easiest access to. For your first week, use it for low-stakes tasks:

Don't worry about optimization. Just get comfortable with the interaction pattern.

Week 2: Apply Prompting Principles

Revisit the prompting framework from Module 3 and apply it deliberately:

Notice how the quality of outputs changes as you prompt more skillfully.

Week 3: Try Something Harder

Push into a task that actually matters:

Evaluate the output critically. What did it get right? Where did it need correction? What would you prompt differently next time?

Week 4: Compare

Now try a second model with a task you've done before. Use the same prompt and compare outputs. You'll develop intuition for the differences—and often find that your preference is less about the model and more about how you've learned to work with it.


Evaluate With Your Own Use Cases

Here's a truth that benchmark tables and feature comparisons can't capture: the only evaluation that matters is how a model performs on your actual work.

Build Your Personal Test Set

Create a small set of 3-5 tasks that represent your real work:

Run these same prompts through different models. Compare the outputs. Which required less editing? Which understood your intent better? Which produced something you'd actually use?

Re-Evaluate After Updates

This is crucial and often overlooked: a model that didn't work for you six months ago might be excellent now. And vice versa—a model you loved might change in ways that don't suit your workflow.

Each major model update (GPT-4 to GPT-4o, Claude 3 to Claude 4, etc.) can significantly change how the model handles specific tasks. Some examples:

When you see announcements about major model updates, revisit your test set. Don't assume your current choice is still the best choice—or that a model you dismissed is still inadequate.

The Ongoing Evaluation Habit

Set a calendar reminder every 3-6 months to run your personal test set across the current versions of each model. This takes 30 minutes and ensures you're always using the best tool for your needs—not just the one you happened to start with.


The Cost Question

Let's be direct about money.

Free tiers are sufficient for exploration and light use. If you're using AI occasionally—a few times a week for non-critical tasks—you may never need to pay.

$20/month is the standard paid tier. ChatGPT Plus, Claude Pro, and Google AI Pro all cluster around this price point. At this tier, you get higher usage limits, access to the best models, and priority features. For a professional tool you might use daily, $20/month is modest—less than many software subscriptions with narrower utility. This is where most regular users land.

Premium tiers ($200-250/month) are for power users. If you're running into limits on the $20 tier, pushing complex coding projects, or need maximum model capabilities for professional work, the premium tiers exist. But most users won't need them.

Recommendation for Most Clinicians
  1. Start with free tiers
  2. Upgrade to ~$20/month when you hit limits regularly
  3. Evaluate premium tiers only if you're a genuine power user
  4. Discuss organizational deployment with compliance and IT before rolling out team solutions

Understanding AI Benchmarks

Benchmarks Have Limited Value

This section explains benchmarks because you'll encounter them in AI discussions. But here's the key point upfront: benchmark scores tell you very little about how useful a model will be for your specific work. A 2% difference on a test doesn't translate to a meaningfully better tool for writing clinical notes or explaining diagnoses. Read this section for context, then focus on your own evaluation.

You'll often see AI companies touting benchmark scores when announcing new models. Headlines declare one model "beats" another on some test. But what do these numbers actually mean—and more importantly, what don't they mean?

What Benchmarks Measure

Benchmarks are standardized tests designed to evaluate specific AI capabilities. They provide a common yardstick for comparing models on tasks like:

What Benchmarks Don't Measure

Here's what no benchmark captures—and what often matters most for real-world use:

The Gap Between Benchmarks and Reality

A model scoring 90% vs 88% on a knowledge test is not meaningfully different for most practical purposes. What matters is whether the model helps you do your work better. The only benchmark that truly matters is your own experience using the tool.

How to Use Benchmark Data

  1. As a rough filter: If a model scores dramatically lower on everything, it's probably less capable overall.
  2. For specific tasks: If you primarily need coding help, look at coding benchmarks. For medical questions, look at MedQA scores.
  3. With skepticism: Companies often cherry-pick favorable benchmarks. Independent testing (like LMArena) provides more balanced pictures.
  4. As a starting point: Use benchmarks to decide which 2-3 models to try, then let your own experience guide your choice.

Key Benchmarks Explained

Benchmark What It Tests Why It Matters
Humanity's Last Exam 2,500 expert-level questions across all disciplines, designed to be genuinely difficult The hardest general benchmark; tests limits of AI reasoning
GPQA Diamond PhD-level science questions (physics, chemistry, biology) Deep reasoning in scientific domains
MedQA USMLE-style medical licensing questions Medical knowledge directly relevant to clinical practice
SimpleQA Short fact-seeking questions with single correct answers Measures hallucination rate and factual accuracy
SimpleBench Common-sense reasoning that humans find easy but AI finds hard Tests practical reasoning vs pattern matching
LMArena Elo Human preference ratings from blind head-to-head comparisons Real users choosing preferred outputs—closest to "usefulness"
SWE-Bench Fixing real bugs in actual open-source codebases Real-world software engineering capability

Benchmark Results (A Snapshot in Time)

Below are scores for flagship models as of late 2025. These numbers will be outdated quickly—new model versions release every few months, and the leaderboard constantly shifts. We include them to illustrate the general landscape, not to make a definitive ranking.

Benchmark GPT-5.1 Claude Opus 4.5 Gemini 3 Pro Notes
Humanity's Last Exam ~36% ~35% 45.8% Gemini leads on hardest reasoning test
GPQA Diamond 88.1% 87.0% 91.9% Gemini ahead on PhD-level science
MedQA ~96% ~94% ~93% All far exceed passing threshold (~60%)
SimpleQA (Factuality) ~63% ~45% 72.1% Gemini leads on factual accuracy
LMArena Elo ~1480 ~1470 1501 Gemini tops human preference ratings
SWE-Bench Verified 76.3% 80.9% 76.2% Claude leads on real-world coding
The Bottom Line on Benchmarks

The numbers in this table will shift with every model release. What matters is this: all three models score 93%+ on MedQA—far above the passing threshold for medical licensing exams. For the vast majority of clinical use cases, all three are capable enough. Don't choose based on benchmark margins. Choose based on ecosystem fit, pricing, and—most importantly—how well the model actually performs on your specific work.


Common Pitfalls and How to Avoid Them

Treating Output as Truth
The models generate plausible text, not verified facts. Treat all outputs as first drafts requiring verification.
Assuming Privacy
Consumer AI tools aren't HIPAA-compliant. Establish clear rules about what can be entered into which tools.
Underusing the Tools
One bad result doesn't mean AI isn't useful. Commit to sustained experimentation over weeks, not minutes.
Overusing the Tools
Don't outsource judgment you should retain. Use AI to augment, not replace, your critical thinking.

Quick Reference: Getting Started

ChatGPT

URL: chat.openai.com

Mobile: iOS and Android apps

Sign up: Google, Microsoft, or email

Best for: Broad capabilities, images, mature ecosystem

Claude

URL: claude.ai

Mobile: iOS and Android apps

Sign up: Google, email, or phone

Best for: Nuanced writing, analysis, coding, long documents

Gemini

URL: gemini.google.com

Mobile: iOS and Android apps

Sign up: Google account required

Best for: Google Workspace, massive context, HIPAA via Workspace


Action Items

Before moving to the next module, complete at least one of these:

  1. Create accounts on all three platforms (if you haven't already). Even if you end up preferring one, having tried the others gives you useful context.
  2. Run the same prompt through all three and compare outputs. Notice the differences in tone, organization, and approach.
  3. Try something you actually need: Draft a real email, summarize a real article, prepare for a real conversation. Experience how the tool performs on your actual work.
  4. Hit a limit and recover: Deliberately work on something complex enough that your first output isn't good. Practice iterative refinement until you're satisfied.
  5. If you're in a healthcare organization: Identify which HIPAA pathway makes sense for your situation and discuss it with appropriate stakeholders.

Summary

ChatGPT, Claude, and Gemini are all capable foundation models that can meaningfully assist clinical work. Each has different ecosystem integrations, pricing structures, and HIPAA pathways. ChatGPT has the largest user base and most extensive ecosystem. Gemini integrates deeply with Google Workspace and offers the most straightforward consumer HIPAA pathway. Claude offers the largest context window outside of Gemini and strong developer tools.

But here's what matters most: the choice between them matters far less than developing skill with whichever you choose. These are tools that respond to how you use them. A well-crafted prompt to any of these models will outperform a vague prompt to the "best" model.

And remember: your evaluation shouldn't be a one-time event. Models improve, workflows change, and what didn't work last year might work beautifully now. Build the habit of periodic re-evaluation with your own real-world test cases.

The Bottom Line

Think of them as brilliant but inexperienced colleagues: genuinely helpful, occasionally wrong, always requiring supervision. Pick one based on your ecosystem and access. Use it enough to develop skill. Periodically test alternatives with your own use cases. With that approach and the prompting skills from earlier modules, you're ready to begin. Now close this document and go have a conversation with one of them. That's where the real learning happens.

Learning Objectives

  • Compare the capabilities, strengths, and limitations of ChatGPT, Claude, and Gemini
  • Identify which model best fits specific clinical use cases
  • Understand HIPAA compliance pathways for each platform
  • Apply a practical framework for getting started with foundation models
  • Recognize common pitfalls in AI adoption and strategies to avoid them

Notes

  1. Why not Grok? You may wonder why xAI's Grok isn't included here. While Grok has some capable underlying technology, we don't recommend it for clinical use due to significant concerns about accuracy and safety guardrails.

    In early 2025, Grok generated and spread false information about prominent public figures, including fabricated claims about an NBA owner that were widely amplified on X (formerly Twitter). The system was also found to produce misleading election-related content and struggled with basic factual queries where other models performed reliably. Independent evaluations have noted that Grok's safety measures are notably weaker than those of ChatGPT, Claude, or Gemini—the model is more likely to generate harmful content when prompted.

    For clinical applications where accuracy and appropriate guardrails matter, these issues are disqualifying. The three platforms covered in this module have demonstrated more robust approaches to safety and factual accuracy—though as we discuss throughout, all AI outputs require verification.

    References:
    Newsweek. "Mark Cuban Confronts Elon Musk Using His Own AI Bot." 2025.
    Axios. "Musk's AI chatbot spread election misinformation, secretaries of state say." August 2024.
    Center for Countering Digital Hate. "Grok AI Election Disinformation." 2024.
    Palo Alto Networks Unit 42. "How Good Are the LLM Guardrails on the Market?." 2024.