FOUNDATIONS

The Big Three

ChatGPT, Claude, and Gemini—your guide to today's leading foundation models and how to choose between them.

~30 min read Practical guide
Core Question

Which foundation model should you use—and does it actually matter?

Starting Here? Read This First

If you've jumped straight to this module hoping to pick a model and get started, you're in good company—this is exactly what most people want. But foundation models are tools, and like any tool, their value depends on how skillfully you use them. Before you dive too deep here, consider at least skimming these foundational concepts from earlier modules:

Key Concepts from Earlier Modules

From Module 1 (How LLMs Think): These models work by predicting the most likely next word in a sequence, trained on enormous datasets of human-generated text. They don't "know" things the way you do—they recognize patterns. This matters because it explains both their remarkable capabilities and their characteristic failures.

From Module 2 (PHI and HIPAA): None of these consumer-facing chat interfaces are HIPAA-compliant out of the box. We'll discuss BAA pathways later in this module, but the critical principle remains: never enter PHI into a consumer AI product without proper safeguards.

From Module 3 (Prompting): The quality of your output depends enormously on the quality of your input. A vague prompt produces vague results. A well-structured prompt with context, role, and constraints produces dramatically better outputs. We'll reference prompting principles throughout this module—they apply equally to all three platforms.

Now, let's meet the models.[1]


The Med Student Analogy

Think of each foundation model as a brilliant medical student who has read essentially everything ever published—every textbook, every journal article, every clinical guideline, every case study, and frankly, every Reddit thread and random blog post too. This student has near-perfect recall of patterns across all that material and can synthesize information across domains in ways that would take you hours or days.

But here's what's crucial: this med student has never actually seen a patient. They haven't felt the resistance of tissue, watched a parent's face crumple at difficult news, or learned from the case that didn't follow the textbook. They know what clinical reasoning looks like on paper, but they don't have clinical judgment.

This framing helps calibrate expectations:

With that framing in mind, let's look at who these three "students" are and what each brings to your practice.


ChatGPT: The First Mover

The Story

OpenAI was founded in December 2015 by Sam Altman, Elon Musk, and others with the stated mission of developing artificial general intelligence that benefits humanity. The company initially operated as a nonprofit research lab before restructuring in 2019 to attract the investment needed for increasingly expensive AI training.

The GPT (Generative Pre-trained Transformer) architecture emerged from this research, with GPT-1 in 2018, GPT-2 in 2019 (initially withheld due to concerns about misuse), and GPT-3 in 2020. But the inflection point came on November 30, 2022, when OpenAI released ChatGPT as a free research preview. Within five days, it had a million users. Within two months, it reached 100 million—the fastest-growing consumer application in history.

That explosive growth fundamentally changed how the world understood AI. Suddenly, anyone could have a conversation with a system that felt like talking to a knowledgeable colleague. The technology wasn't new, but the accessibility was.

The Current Offering

As of late 2025, ChatGPT operates across several tiers:

ChatGPT Free

$0

Access to GPT-4o with usage limits. Good for exploration and occasional use.

ChatGPT Plus

$20/month

Higher limits, priority access, o1 reasoning models, DALL-E, advanced voice mode.

ChatGPT Pro

$200/month

For power users. o1 pro mode, extended features, essentially unlimited usage.

ChatGPT Team

$25-30/user/month

Collaborative workspace. Data not used for training. Still not HIPAA-compliant without additional measures.

Strengths

ChatGPT excels at breadth and general capability. Its training data is enormous and diverse, making it effective across an unusually wide range of tasks—from creative writing to coding to analysis. The ecosystem is mature, with a GPT Store containing millions of customized applications, extensive plugin support, and deep integrations with tools many people already use.

The voice and image capabilities are polished and genuinely useful. Advanced Voice Mode allows natural, conversational interaction that feels different from typing. Image generation through DALL-E has improved dramatically.

For coding assistance, ChatGPT remains strong, with Code Interpreter allowing it to execute Python code to analyze data, create visualizations, and iterate on complex problems.

Limitations

ChatGPT has historically been more prone to sycophancy—telling users what they seem to want to hear rather than pushing back when appropriate. An April 2025 update was actually rolled back because the model had become excessively agreeable to the point of supporting clearly problematic ideas. OpenAI has worked to address this, but it remains a known tendency.

The HIPAA pathway is more complex than some alternatives. OpenAI offers BAAs for API services and, for ChatGPT, only through Enterprise or Edu plans with sales-managed accounts. ChatGPT Free, Plus, Pro, and even Team plans are explicitly not covered by BAAs and cannot be used with protected health information.


Claude: The Safety-First Approach

The Story

Anthropic was founded in 2021 by Dario and Daniela Amodei, along with several other former OpenAI researchers and executives. The founding team included key figures in AI safety research, and that orientation shaped the company's approach from the beginning.

The company developed "Constitutional AI," a training approach that uses AI feedback (rather than exclusively human feedback) to shape model behavior according to a set of principles. The goal was to create systems that are helpful but also harmless and honest—what Anthropic describes as the "HHH" framework.

Claude 1.0 launched in March 2023, positioning itself as a thoughtful alternative to ChatGPT. The Claude 3 family arrived in March 2024, introducing the Haiku/Sonnet/Opus tiering (small/medium/large models with different capability and cost profiles). By late 2025, Claude Opus 4.5 emerged as the latest iteration, with Anthropic positioning it as "the best model in the world for coding, agents, and computer use."

The Current Offering

Claude Free

$0

Access to Claude with usage limits that reset every few hours. Good for exploration.

Claude Pro

$20/month

~5x usage of free tier, all models including Opus, priority access to new features.

Claude Max

$100-200/month

5x-20x Pro limits. Designed for power users, especially Claude Code developers.

Claude Team

$25-30/user/month

Collaborative features, admin controls. Data not used for training. Min 5 seats.

Strengths

Claude's distinguishing characteristic is nuanced reasoning and writing quality. Users consistently report that Claude's outputs feel more thoughtful, with better handling of ambiguity and more natural prose. For drafting patient communications, clinical notes, or educational materials, many find Claude produces text that requires less editing.

The safety orientation manifests in several ways: Claude is less likely to generate harmful content, more likely to express appropriate uncertainty, and specifically trained to reduce behaviors like sycophancy, deception, and "power-seeking." For clinical applications where reliability matters, this orientation has value.

Long context handling is excellent. Claude supports 200,000-token context windows (roughly 150,000 words), allowing you to upload substantial documents—research papers, clinical protocols, patient histories—and have coherent conversations about them.

Claude Code represents a genuine differentiator for anyone who works with code or technical systems. The ability to delegate coding tasks through natural language, with the model maintaining context across complex multi-step projects, has attracted significant developer adoption.

Limitations

Claude can be more conservative than some alternatives. The same safety orientation that prevents harmful outputs sometimes means Claude declines requests that other models would handle. This can occasionally feel like friction for benign use cases.

The ecosystem is less developed than ChatGPT's. There's no equivalent to the GPT Store, fewer integrations with third-party tools, and a smaller community creating resources and tutorials.

Image generation is not a native Claude capability. While Claude can analyze images you provide and create artifacts like code or documents, it doesn't generate images from scratch the way ChatGPT or Gemini can.


Gemini: The Integration Play

The Story

Google's path to Gemini began with earlier AI efforts including LaMDA (the model behind the original Bard chatbot) and PaLM (Pathways Language Model). But Google's extensive AI research, which predates the current generation by decades, positioned the company as a natural major player once the race began.

Gemini was announced in December 2023 as Google's response to GPT-4, with the company emphasizing its native multimodal training—the model was trained from the beginning on text, images, and other modalities together, rather than having capabilities bolted on afterward.

Gemini 1.5 arrived in February 2024 with a groundbreaking 1-million-token context window (later expanded to 2 million)—dramatically larger than competitors at the time. In November 2025, Google announced Gemini 3, positioned as "our most intelligent model" with significant improvements in reasoning, multimodal understanding, and agent capabilities.

The Current Offering

Gemini Free

$0

Access to Gemini 2.5 Flash, basic features, limited Deep Research access.

Google AI Pro

$19.99/month

Gemini 2.5 Pro, 1M-token context, Deep Research, Workspace integration, 2TB storage.

Google AI Ultra

$249.99/month

Gemini 3 Pro with Deep Think, highest limits, 30TB storage, Veo 3 video generation.

Strengths

The Google ecosystem integration is Gemini's defining advantage. If your organization lives in Google Workspace, Gemini works natively within Gmail, Docs, Sheets, Slides, and Meet. The side panel in Google Docs can help you write, summarize, and refine. Gemini in Gmail drafts responses and summarizes threads. For organizations already committed to Google, this reduces friction dramatically.

Multimodal capabilities are genuinely strong. Gemini's ability to process video (not just images) allows analysis of recorded content, extraction of key moments, and summarization of visual information in ways that currently exceed competitors.

The context window remains best-in-class at 1-2 million tokens, allowing you to work with enormous documents or entire codebases without hitting limits.

For healthcare specifically, Google Workspace with Gemini is HIPAA-eligible. Google's HIPAA Included Functionality list explicitly covers Gemini in Workspace. Organizations that sign Google's Business Associate Amendment through the Admin Console can use these services with protected health information.

Limitations

Gemini's full potential requires the Google ecosystem. If you don't use Google Workspace, many of the integration advantages disappear. For organizations on Microsoft 365 or other platforms, the standalone Gemini experience is less compelling.

The pricing structure is more complex and the highest tier ($249.99/month for Ultra) is significantly more expensive than competitors' premium offerings. While it includes substantial bundled value (storage, YouTube Premium, advanced video generation), the headline price may exceed what many users need.


Side-by-Side Comparison

Factor ChatGPT Claude Gemini
Developer OpenAI (Microsoft-backed) Anthropic (Amazon/Google-backed) Google DeepMind
Consumer Price Entry $20/month (Plus) $20/month (Pro) $19.99/month (AI Pro)
Premium Tier $200/month (Pro) $200/month (Max 20x) $249.99/month (Ultra)
Context Window 128K tokens 200K tokens 1-2M tokens
Native Image Gen Yes (DALL-E) No Yes (Imagen)
Voice Mode Advanced Voice Mode Limited Gemini Live
Ecosystem GPT Store, plugins MCP integrations Google Workspace
Writing Style Engaging, occasionally verbose Nuanced, thoughtful Variable, improving

HIPAA/BAA Comparison

Platform Consumer Chat BAA API BAA Easiest Path
ChatGPT Enterprise/Edu only Yes, via application Azure OpenAI Service
Claude No Yes, with ZDR agreement AWS Bedrock
Gemini Workspace integration covered Yes via Vertex AI Google Workspace + BAA
Critical Reminder

For the consumer chat interfaces most people use day-to-day (ChatGPT Plus, Claude Pro, standard Gemini), none are HIPAA-compliant and should never be used with PHI without additional safeguards.


What Each Model Does Best

Based on extensive user reports and benchmark data, here's where each model tends to excel:

ChatGPT Excels At

  • Breadth of capability: One tool that does many things well
  • Creative content: Marketing copy, engaging educational materials
  • Image generation: Visuals for presentations or patient education
  • Voice interaction: When you want to talk rather than type
  • Ecosystem integration: Custom GPTs and plugins
  • Step-by-step tutorials: Clear, accessible explanations

Claude Excels At

  • Nuanced writing: Patient communications, clinical notes
  • Complex analysis: Reasoning through ambiguous situations
  • Long document work: Summarizing papers, analyzing protocols
  • Coding and technical work: Especially via Claude Code
  • Safety-sensitive applications: Conservative guardrails
  • Academic writing: Papers, grants, formal documentation

Gemini Excels At

  • Google Workspace integration: AI woven throughout your tools
  • Video and multimodal analysis: Processing recorded content
  • Massive context: Entire books, codebases, extensive docs
  • Real-time assistance: Voice connected to calendar and email
  • HIPAA-compliant productivity: Compliant daily workflow AI
  • Data analysis: Spreadsheets and structured information

Practical Guidance: Just Start

The most important advice in this entire module is this: pick one and start using it.

The differences between these models matter at the margins. For the vast majority of tasks most clinicians will encounter, all three will produce useful results. The limiting factor isn't which model you choose—it's whether you've developed the skill to use it effectively.

Week 1: Establish a Baseline

Choose whichever model you have easiest access to. For your first week, use it for low-stakes tasks:

Don't worry about optimization. Just get comfortable with the interaction pattern.

Week 2: Apply Prompting Principles

Revisit the prompting framework from Module 3 and apply it deliberately:

Notice how the quality of outputs changes as you prompt more skillfully.

Week 3: Try Something Harder

Push into a task that actually matters:

Evaluate the output critically. What did it get right? Where did it need correction? What would you prompt differently next time?

Week 4: Compare

Now try a second model with a task you've done before. Use the same prompt and compare outputs. You'll develop intuition for the differences—and often find that your preference is less about the model and more about how you've learned to work with it.


The Cost Question

Let's be direct about money.

Free tiers are sufficient for exploration and light use. If you're using AI occasionally—a few times a week for non-critical tasks—you may never need to pay.

$20/month is the standard paid tier. ChatGPT Plus, Claude Pro, and Google AI Pro all cluster around this price point. At this tier, you get higher usage limits, access to the best models, and priority features. For a professional tool you might use daily, $20/month is modest—less than many software subscriptions with narrower utility. This is where most regular users land.

Premium tiers ($200-250/month) are for power users. If you're running into limits on the $20 tier, pushing complex coding projects, or need maximum model capabilities for professional work, the premium tiers exist. But most users won't need them.

Recommendation for Most Clinicians
  1. Start with free tiers
  2. Upgrade to ~$20/month when you hit limits regularly
  3. Evaluate premium tiers only if you're a genuine power user
  4. Discuss organizational deployment with compliance and IT before rolling out team solutions

Understanding AI Benchmarks

You'll often see AI companies touting benchmark scores when announcing new models. Headlines declare one model "beats" another on some test. But what do these numbers actually mean—and more importantly, what don't they mean?

What Benchmarks Measure

Benchmarks are standardized tests designed to evaluate specific AI capabilities. They provide a common yardstick for comparing models on tasks like:

What Benchmarks Don't Measure

Here's what no benchmark captures—and what often matters most for real-world use:

The Gap Between Benchmarks and Reality

A model scoring 90% vs 88% on a knowledge test is not meaningfully different for most practical purposes. What matters is whether the model helps you do your work better. The only benchmark that truly matters is your own experience using the tool.

How to Use Benchmark Data

  1. As a rough filter: If a model scores dramatically lower on everything, it's probably less capable overall.
  2. For specific tasks: If you primarily need coding help, look at coding benchmarks. For medical questions, look at MedQA scores.
  3. With skepticism: Companies often cherry-pick favorable benchmarks. Independent testing (like LMArena) provides more balanced pictures.
  4. As a starting point: Use benchmarks to decide which 2-3 models to try, then let your own experience guide your choice.

Key Benchmarks Explained

Benchmark What It Tests Why It Matters
Humanity's Last Exam 2,500 expert-level questions across all disciplines, designed to be genuinely difficult The hardest general benchmark; tests limits of AI reasoning
GPQA Diamond PhD-level science questions (physics, chemistry, biology) Deep reasoning in scientific domains
MedQA USMLE-style medical licensing questions Medical knowledge directly relevant to clinical practice
SimpleQA Short fact-seeking questions with single correct answers Measures hallucination rate and factual accuracy
SimpleBench Common-sense reasoning that humans find easy but AI finds hard Tests practical reasoning vs pattern matching
LMArena Elo Human preference ratings from blind head-to-head comparisons Real users choosing preferred outputs—closest to "usefulness"
SWE-Bench Fixing real bugs in actual open-source codebases Real-world software engineering capability

Current Benchmark Results (November 2025)

Below are scores for the current flagship models: OpenAI's GPT-5.1, Anthropic's Claude Opus 4.5, and Google's Gemini 3 Pro. These represent the state of the art as of late November 2025.

Benchmark GPT-5.1 Claude Opus 4.5 Gemini 3 Pro Notes
Humanity's Last Exam ~36% ~35% 45.8% Gemini leads on hardest reasoning test
GPQA Diamond 88.1% 87.0% 91.9% Gemini ahead on PhD-level science
MedQA ~96% ~94% ~93% All far exceed passing threshold (~60%)
SimpleQA (Factuality) ~63% ~45% 72.1% Gemini leads on factual accuracy
LMArena Elo ~1480 ~1470 1501 Gemini tops human preference ratings
SWE-Bench Verified 76.3% 80.9% 76.2% Claude leads on real-world coding
The Bottom Line on Benchmarks

As of November 2025, Gemini 3 Pro leads on reasoning benchmarks (Humanity's Last Exam, GPQA) and factual accuracy (SimpleQA). Claude Opus 4.5 dominates real-world coding tasks (SWE-Bench). GPT-5.1 excels on medical knowledge (MedQA) and offers the most mature ecosystem. But here's what matters: all three score 93%+ on MedQA—far above the passing threshold. For clinical use, all are capable enough. Your choice should depend on ecosystem fit, pricing, and personal preference rather than benchmark margins.


Common Pitfalls and How to Avoid Them

Treating Output as Truth
The models generate plausible text, not verified facts. Treat all outputs as first drafts requiring verification.
Assuming Privacy
Consumer AI tools aren't HIPAA-compliant. Establish clear rules about what can be entered into which tools.
Underusing the Tools
One bad result doesn't mean AI isn't useful. Commit to sustained experimentation over weeks, not minutes.
Overusing the Tools
Don't outsource judgment you should retain. Use AI to augment, not replace, your critical thinking.

Quick Reference: Getting Started

ChatGPT

URL: chat.openai.com

Mobile: iOS and Android apps

Sign up: Google, Microsoft, or email

Best for: Broad capabilities, images, mature ecosystem

Claude

URL: claude.ai

Mobile: iOS and Android apps

Sign up: Google, email, or phone

Best for: Nuanced writing, analysis, coding, long documents

Gemini

URL: gemini.google.com

Mobile: iOS and Android apps

Sign up: Google account required

Best for: Google Workspace, massive context, HIPAA via Workspace


Action Items

Before moving to the next module, complete at least one of these:

  1. Create accounts on all three platforms (if you haven't already). Even if you end up preferring one, having tried the others gives you useful context.
  2. Run the same prompt through all three and compare outputs. Notice the differences in tone, organization, and approach.
  3. Try something you actually need: Draft a real email, summarize a real article, prepare for a real conversation. Experience how the tool performs on your actual work.
  4. Hit a limit and recover: Deliberately work on something complex enough that your first output isn't good. Practice iterative refinement until you're satisfied.
  5. If you're in a healthcare organization: Identify which HIPAA pathway makes sense for your situation and discuss it with appropriate stakeholders.

Summary

ChatGPT, Claude, and Gemini are all capable foundation models that can meaningfully assist clinical work. ChatGPT offers the broadest ecosystem and most mature feature set. Claude provides nuanced reasoning and a safety-first approach. Gemini integrates deeply with Google Workspace and offers the easiest HIPAA pathway for organizations already in that ecosystem.

The choice between them matters less than developing skill with whichever you choose. These are tools that respond to how you use them. A well-crafted prompt to any of these models will outperform a vague prompt to the "best" model.

The Bottom Line

Think of them as brilliant but inexperienced colleagues: genuinely helpful, occasionally wrong, always requiring supervision. With that framing and the prompting skills from earlier modules, you're ready to begin. Now close this document and go have a conversation with one of them. That's where the real learning happens.

Learning Objectives

  • Compare the capabilities, strengths, and limitations of ChatGPT, Claude, and Gemini
  • Identify which model best fits specific clinical use cases
  • Understand HIPAA compliance pathways for each platform
  • Apply a practical framework for getting started with foundation models
  • Recognize common pitfalls in AI adoption and strategies to avoid them

Notes

  1. Why not Grok? You may wonder why xAI's Grok isn't included here. While Grok has some capable underlying technology, we don't recommend it for clinical use due to significant concerns about accuracy and safety guardrails.

    In early 2025, Grok generated and spread false information about prominent public figures, including fabricated claims about an NBA owner that were widely amplified on X (formerly Twitter). The system was also found to produce misleading election-related content and struggled with basic factual queries where other models performed reliably. Independent evaluations have noted that Grok's safety measures are notably weaker than those of ChatGPT, Claude, or Gemini—the model is more likely to generate harmful content when prompted.

    For clinical applications where accuracy and appropriate guardrails matter, these issues are disqualifying. The three platforms covered in this module have demonstrated more robust approaches to safety and factual accuracy—though as we discuss throughout, all AI outputs require verification.

    References:
    Newsweek. "Mark Cuban Confronts Elon Musk Using His Own AI Bot." 2025.
    Axios. "Musk's AI chatbot spread election misinformation, secretaries of state say." August 2024.
    Center for Countering Digital Hate. "Grok AI Election Disinformation." 2024.
    Palo Alto Networks Unit 42. "How Good Are the LLM Guardrails on the Market?." 2024.