FOUNDATIONS

The Data You Think Is Protected Isn't

Understanding PHI, HIPAA, and AI—what makes this different from every previous healthcare technology.

~25 min read 7 readings
Core Question

Why does AI create genuinely new privacy challenges—not just faster versions of old problems—and how do you recognize the exposures that matter in clinical workflows?

The Ontario Hospital Breach

In September 2024, a physician at an Ontario hospital installed Otter.ai—an AI transcription tool—on his personal device. He'd left the hospital over a year earlier but was still on the invite list for weekly hepatology rounds. When the next meeting started, the Otter bot joined automatically, recorded physicians discussing seven patients by name, and emailed a transcript to 65 people—including 12 who no longer worked at the hospital.

No one in the meeting knew the bot was there. Patient names, diagnoses, and treatment details were now sitting in inboxes of people who had no business seeing them—and on Otter's servers, where the company's privacy policy allows use of recordings to train their AI models.

This incident captures why AI is genuinely different from previous healthcare technology. It's not just a faster fax machine or a better database. AI tools can act autonomously, join meetings without invitation, record without visible indication, and transmit data to external servers in ways that bypass every traditional safeguard. The regulatory framework—HIPAA, designed in 1996 for paper charts and fax machines—wasn't built for software that learns from every input and can take actions on its own.

This module will help you understand where the real risks are. We'll cover the basics of HIPAA and PHI, but the goal isn't comprehensive compliance training—it's developing intuition for the edge cases that matter when AI enters clinical workflows. You'll learn to recognize three categories of PHI exposure that escalate in subtlety: direct PHI (the obvious identifiers), indirect PHI (data that becomes identifying when combined), and shadow AI (the untracked tools that create compliance blind spots—like that Otter bot).

What Makes AI Different

Traditional healthcare IT systems—EHRs, billing systems, scheduling software—are essentially sophisticated databases. They store what you put in, retrieve what you ask for, and follow deterministic rules. If patient data leaks, it's usually because of a configuration error, a hack, or human mistake. The system itself doesn't learn, doesn't act autonomously, and doesn't transmit data unless explicitly programmed to.

AI systems are fundamentally different in several ways that matter for privacy:

They Learn From Data
Your input may become training data—embedded in model weights, difficult to audit, impossible to fully "delete"
They Act Autonomously
AI tools can take actions based on triggers, without human initiation—like that Otter bot joining a meeting
External Transmission
Most AI processes data on external servers. The moment data leaves your network, you've lost direct control
Opaque Data Flows
AI integrates with calendars, email, and other systems in ways that make data flows surprisingly complex

The Velocity Problem

New AI tools appear weekly. Employees adopt them because they genuinely help—who wouldn't want to cut documentation time in half? But each tool potentially creates a new data flow, a new vendor relationship, and a new compliance gap. Traditional IT governance, with its months-long procurement cycles, can't keep pace. By the time a tool is formally evaluated, half the staff may already be using it.

HIPAA Fundamentals for AI

Who HIPAA Actually Covers (and Who It Doesn't)

HIPAA is an entity-based framework, not a data-based one. This distinction is critical. The law doesn't protect "health data"—it protects health data held by specific types of organizations.

Category Examples HIPAA Coverage
Covered Entities Health plans, clearinghouses, providers who transmit electronically Yes
Business Associates AI vendors with signed BAA Yes (via BAA)
Consumer Apps Health apps, fitness trackers, wearables No
AI Without BAA ChatGPT (consumer), most free AI tools No

The gap this creates: When a patient enters symptoms into a consumer health app, or a clinician pastes notes into ChatGPT without a BAA, that data isn't protected by HIPAA. The FTC has stepped into some of this gap with the Health Breach Notification Rule, but enforcement is inconsistent and the protections are narrower.

The Three HIPAA Rules That Matter for AI

Privacy Rule: Governs how PHI can be used and disclosed. PHI can generally only be used for treatment, payment, and healthcare operations without explicit patient authorization. Using PHI to train commercial AI models typically requires either authorization or de-identification.

Security Rule: Requires administrative, physical, and technical safeguards for electronic PHI. In January 2025, HHS proposed the first major update in 20 years, mandating encryption, multi-factor authentication, and 72-hour disaster recovery. AI systems processing ePHI will face these enhanced standards.

Breach Notification Rule: Requires notification within 60 days of discovering a breach of unsecured PHI. From 2018-2023, large breaches increased 102% and affected individuals increased 1,002%. An AI system that inadvertently exposes PHI triggers these requirements.

Direct PHI in AI Systems

Direct PHI includes the 18 categories of identifiers that HIPAA's Safe Harbor de-identification method requires you to remove. These are the obvious markers that link data to individuals.

The 18 Safe Harbor Identifiers

Identifier Category AI-Specific Considerations
Names May appear in transcription, NLP outputs, training data
Geographic data smaller than state ZIP codes in combined datasets; geolocation in app data
Dates (except year) related to individual Visit timestamps, DOB, admission/discharge dates
Phone/fax numbers, email addresses Contact info in scheduling data, patient portals
SSN, medical record numbers Often embedded in EHR exports used for training
Device identifiers, IP addresses Logged by AI systems, app analytics, telehealth platforms
URLs, biometric identifiers, photos CT/MRI reconstructions, voice prints, facial images
Any unique identifying number/code Patient IDs, encounter numbers, prescription IDs

Where Direct PHI Enters AI Systems

Training Data: When AI models are trained on clinical data, PHI may be embedded in the model weights themselves. This creates a form of data persistence that's difficult to audit and impossible to fully "delete." Model inversion attacks have demonstrated the ability to extract training data from some models.

Inference-Time Inputs: When clinicians paste patient notes into AI tools, that data is transmitted to external servers. Even if the vendor promises not to use it for training, it may be logged, cached, or retained for abuse monitoring. Most consumer AI tools retain data for 30+ days.

Generated Outputs: AI-generated clinical notes, summaries, and recommendations become PHI themselves. If an ambient scribe generates a SOAP note, that output requires the same protections as a manually-written note.

The BAA Requirement

Before using any AI tool with PHI, a Business Associate Agreement must be in place. The BAA establishes the vendor as a business associate, binding them to HIPAA requirements including safeguards, use restrictions, and breach notification.

Critical Point

A vendor claiming to be "HIPAA compliant" means nothing without a signed BAA. The FTC has taken enforcement actions against companies making false HIPAA compliance claims (GoodRx, BetterHelp). Always verify with documentation.

Current BAA Availability by Platform

Indirect PHI and Re-identification Risk

This is where the regulatory framework shows its age. HIPAA's Safe Harbor method was designed before modern machine learning, before data brokers, before the explosion of auxiliary datasets that make re-identification increasingly feasible.

The Mosaic Effect

The mosaic effect describes how individually benign data points become identifying when combined. In 1997, researcher Latanya Sweeney demonstrated that 87% of Americans could be uniquely identified using just three data points: ZIP code, birth date, and gender. She famously re-identified Massachusetts Governor William Weld's medical records from "anonymized" hospital data by linking it to publicly available voter rolls.

For AI systems, this creates a fundamental tension. Machine learning thrives on rich, detailed data. De-identification that's sufficient to prevent re-identification often strips the clinical utility that makes the data valuable for training or analysis.

How Many Data Points Does It Take?

The research on re-identification is sobering. Multiple studies have consistently shown that 3-5 indirect identifiers are typically sufficient to re-identify individuals from medical records, especially when combined with publicly available datasets.

The Research Evidence
  • Sweeney (1997, 2000): Demonstrated that 87% of the U.S. population can be uniquely identified by just 3 variables—ZIP code, birth date, and gender. Using only publicly available voter registration data, she re-identified the Massachusetts Governor's medical records.
  • Golle (2006): Found that combining gender, ZIP code, and birth date uniquely identifies 63% of the U.S. population. Adding a fourth variable (like race or marital status) increases this substantially.
  • Narayanan & Shmatikov (2008): Re-identified Netflix users by combining just 2-8 movie ratings with timestamps against public IMDB reviews. The same technique applies to healthcare—sparse data points combined with auxiliary information.
  • El Emam et al. (2011): Systematic review of re-identification attacks found that records with as few as 3 quasi-identifiers were vulnerable, with success rates ranging from 10-35% depending on the external dataset used.
  • Rocher et al. (2019): Using machine learning on a dataset with just 15 demographic attributes, correctly re-identified 99.98% of Americans. Even incomplete datasets with fewer attributes achieved 83% accuracy with only 3-4 data points.
Visualizing the Mosaic Effect

Individual data points look safe. Combined, they form a fingerprint.

ZIP CODE
02115
BIRTH DATE
Aug 4
GENDER
Male
Result: Unique Identification
Only 1 person in Cambridge matches this profile.
The Bottom Line

The data consistently shows that 3-5 indirect identifiers are typically sufficient to re-identify individuals from medical records, especially when combined with publicly available datasets. This is why HIPAA's Safe Harbor method requires removing all 18 direct identifiers AND ensuring no actual knowledge exists that remaining information could identify individuals.

The implication for AI is clear: when you paste a clinical scenario into ChatGPT or any consumer AI tool, the combination of age, gender, diagnosis, medications, and timeline may be sufficient to identify your patient—even if you've removed their name.

Quasi-Identifiers in Healthcare Data

Even after removing the 18 Safe Harbor identifiers, healthcare data contains numerous quasi-identifiers that can enable re-identification when combined with external data sources:

AI-Specific Re-identification Risks

Image Reconstruction: CT and MRI scans contain sufficient geometric information to reconstruct facial features. A "de-identified" head CT can potentially be matched to a photograph using 3D reconstruction techniques. AI dramatically improves the feasibility of such attacks.

Voice Prints: Voice recordings—including those captured by ambient scribes—are explicitly listed as biometric identifiers under HIPAA. Speaker identification models can match voices across recordings, potentially linking "anonymized" research data to identified individuals.

Training Data Extraction: Model inversion and membership inference attacks can potentially extract or verify the presence of specific records in training data. This transforms the model itself into a form of data storage that may be subject to HIPAA requirements.

The Safe Harbor List Is Outdated

The 18 Safe Harbor identifiers were defined in 1996 and haven't been updated since. They don't account for:

Shadow AI and Stealth PHI Exposure

Shadow AI is the healthcare equivalent of shadow IT: employees using AI tools without organizational approval, oversight, or integration into compliance frameworks. It's the fastest-growing and least-controlled category of PHI exposure.

The Scale of the Problem

Research indicates that nearly 95% of healthcare organizations believe their staff are already using generative AI in email or content workflows. 62% of leaders have directly observed employees using unsanctioned tools. Yet a quarter of organizations have not formally approved any AI use—meaning staff are acting without oversight, outside compliance frameworks, and without BAAs in place.

Shadow AI incidents account for 20% of AI-related security breaches—7 percentage points higher than incidents involving sanctioned AI.

The Traffic Pattern That Proves It

Want proof that AI is a work tool, not a toy? Similarweb data shows ChatGPT weekday usage is 50-60% higher than weekends. The pattern is so consistent it creates a "sawtooth" graph—traffic spikes Monday through Friday, then drops every Saturday and Sunday.

This mirrors what healthcare organizations are seeing: clinicians use AI during working hours, for work tasks, to solve work problems. They're not browsing ChatGPT for fun—they're using it to write notes, draft letters, and look up information. Which means every weekday, AI tools are processing work content. The question is whether that content includes PHI, and whether the tools have BAAs.

Anatomy of the Ontario Hospital Breach

Let's return to the Otter.ai incident, because it illustrates almost every shadow AI failure mode:

Failure What Happened
Personal device, work data Physician installed Otter on personal device using personal email still on meeting invite list
No offboarding process Physician left in June 2023 but remained on meeting invites until breach in September 2024—over 15 months
Autonomous AI action Otter's "notetaker bot" joined the meeting automatically based on calendar invite. No one clicked anything.
No visibility Participants didn't notice the bot until emails went out. PHI already transmitted to Otter's servers.
Incomplete remediation Of 65 recipients, only 53 confirmed deletion. Data remained on Otter's servers for potential model training.

Common Shadow AI Scenarios

Clinical Documentation: A physician pastes patient notes into ChatGPT to generate a summary or draft a letter. They've just transmitted PHI to a platform without a BAA, potentially violating HIPAA.

Administrative Tasks: A clinic manager uploads patient scheduling data to an AI tool for analysis. A billing specialist uses AI to help with denial appeals. Each creates an untracked data flow to unvetted platforms.

The Productivity Trap: A physician discovers ChatGPT can generate patient summaries in seconds. They start with de-identified summaries, then gradually include more context, then patient names "just this once" when running late. By the time anyone notices, months of PHI has been transmitted.

Why Shadow AI Happens

  1. Productivity pressure: Clinicians are drowning in documentation. AI tools offer genuine time savings.
  2. Approval friction: Sanctioned AI tools require lengthy procurement. Free tools are available immediately.
  3. Awareness gaps: Many users don't understand that pasting text into an AI tool constitutes data transmission to an external server.
  4. Tool limitations: Approved tools may not meet user needs, pushing staff to seek alternatives.
  5. Embedded AI: AI features are increasingly embedded in approved tools (CRM systems, email clients) without separate vetting.

Governance Approaches

Industry experts advise against blanket bans—they don't work and push usage further underground. Instead:

  1. Provide alternatives: Deploy enterprise AI tools with BAAs, security controls, and logging. Make the approved path as convenient as the shadow path.
  2. Create fast-track approval: Establish a streamlined process for evaluating new AI tools. Reduce the friction that drives shadow usage.
  3. Educate continuously: Help staff understand why the restrictions exist and what's at stake. Focus on the "why" rather than just the "don't."
  4. Monitor adaptively: Use tools that can detect AI usage. Treat detections as opportunities for guidance, not punishment.
  5. Listen to shadow users: Shadow AI reveals unmet needs. Use it as free market research—what are people trying to accomplish?

What You Should Actually Do

Before Deploying Any AI System

  1. Map the data flows: Where does PHI enter the system? Where is it stored? Where does it go? Who has access?
  2. Verify BAA status: Is a Business Associate Agreement in place? What does it actually cover? Does it address AI-specific risks?
  3. Assess minimum necessary: Does the AI need all the data being provided? Can inputs be limited to what's actually required?
  4. Evaluate de-identification: If using de-identified data, which method was used? Has re-identification risk been assessed in light of AI capabilities?
  5. Review data retention: How long does the vendor retain inputs? Are they used for model training? What happens to audit logs?
  6. Plan for breach: If PHI is exposed through this system, what's the notification plan? Who's responsible for detection?
  7. Document everything: Risk assessments, vendor evaluations, configuration decisions, training records. The documentation is your defense.

Vendor Evaluation Questions

When evaluating AI vendors for healthcare use, ask:

The Bottom Line

We're operating in a gap between what AI can do and what regulators have addressed. HIPAA wasn't written for neural networks that learn from every input. The safest approach is conservative: treat AI systems that touch patient data as high-risk by default, require BAAs and security verification before deployment, and build governance structures that can adapt as both capabilities and regulations evolve.

Try This with NotebookLM

Upload this module's readings to a NotebookLM notebook and explore:

  • Ask it to summarize the key differences between Safe Harbor and Expert Determination de-identification
  • Generate a checklist for evaluating AI vendor compliance
  • Create an Audio Overview comparing the Ontario breach to other shadow AI scenarios

Remember: NotebookLM itself is grounded to your uploaded sources—a practical example of how constrained AI tools can be safer for sensitive content.

Readings

Prioritize This!
PubMed Central · A peer-reviewed look at AI specifically through the lens of learning. Warns of "automation bias"—where learners blindly trust AI over their own judgment. Argues that digital literacy should now be a core competency for medical students.
Medical Economics · The most practical guide available. Introduces the FAVES framework and the "AI Nutrition Label"—questions to ask vendors about training data and bias. Explains why a BAA is non-negotiable.
USC Sol Price School of Public Policy · Addresses the "shadow IT" problem. Explains why pasting patient data into a public model makes that data "training fodder," effectively publishing PHI to a third party.
Accountable HQ · Real-world scenarios of AI implementation. Demystifies "Encryption in Transit" vs. "Encryption at Rest" and shows why stripping names isn't enough for de-identification.
Mintz · Comprehensive overview of AI-HIPAA considerations including data privacy, FTC enforcement trends, and compliance strategies for healthcare organizations.
Foley & Lardner LLP · Tackles the "Black Box" problem from a compliance standpoint—how do you audit an algorithm you can't see inside? Clarifies the distinction between "treatment" and "research" when training AI models.
AMA Journal of Ethics · Argues that AI doesn't just introduce new risks but heightens existing ones. Discusses the "Consent to Data Repurposing" dilemma—where patient data collected for care is reused to train commercial algorithms.
Loyola Law Review · Rigorous legal analysis challenging whether HIPAA is sufficient for the AI age. Details limitations of de-identification when AI can re-identify individuals by cross-referencing non-health data.
PMC · Focuses on solutions: Federated Learning allows AI to train on patient data without that data ever leaving the hospital's servers. Widely considered the future of HIPAA-compliant AI development.
NCBI / NIH · Highlights "regulatory inconsistency" where AI development might fall outside HIPAA if performed by non-covered entities. Discusses how the "Public Health" exception is often stretched to justify AI surveillance.
Google · An interactive comic explaining how we can train models without moving patient data—key for healthcare privacy.
Latanya Sweeney (Carnegie Mellon) · The original paper proving 87% of Americans are identifiable via ZIP, DOB, and Gender.
Nature Communications · Why "de-identification" is mathematically difficult—99.98% of Americans can be re-identified with 15 attributes.
HHS.gov · The official Safe Harbor and Expert Determination standards.
IPC Ontario · Privacy-protective steps for use of AI and engagement of AI vendors in healthcare.
FTC · The rules that apply to non-HIPAA entities (like many AI health apps). Updated in 2024 to explicitly cover health apps.
NIST · The voluntary "gold standard" for governing AI risk.
PMC · A breakdown of why standard chatbots fail the BAA test.

Reflection Questions

  1. Think about AI tools you or your colleagues currently use. Do any of them involve patient data? Is there a BAA in place?
  2. If you discovered a colleague was using ChatGPT with patient notes to save time on documentation, how would you approach that conversation?
  3. Consider the mosaic effect: what combinations of data in your organization might allow re-identification even after Safe Harbor de-identification?
  4. How would you design an AI governance process that reduces shadow AI while supporting clinician productivity?

Learning Objectives

  • Explain why AI creates genuinely new privacy challenges compared to traditional healthcare IT
  • Identify who HIPAA covers and recognize the coverage gaps that affect AI tools
  • List the 18 Safe Harbor identifiers and explain their limitations in the AI era
  • Describe the mosaic effect and quasi-identifiers that enable re-identification
  • Recognize shadow AI patterns and explain why blanket bans are ineffective
  • Apply a practical checklist for evaluating AI vendors before PHI exposure