Bias, Ethics, and the Training Data Problem
AI doesn't introduce bias into healthcare—it inherits, encodes, and scales the biases already embedded in our data and our systems.
If AI learns from data that reflects decades of healthcare disparities, how do we prevent it from perpetuating—or amplifying—those inequities?
The Mirror Problem
Here's an uncomfortable truth: AI systems are mirrors. They reflect back what we show them. If the data they learn from contains bias, the AI will learn that bias. If historical medical decisions disadvantaged certain populations, AI trained on those decisions will learn to disadvantage them too.
This isn't a bug—it's how machine learning fundamentally works. The algorithm's job is to find patterns in data and replicate them. It has no concept of fairness, justice, or ethics. It simply learns: given inputs like this, outputs like that have occurred before. If "inputs like this" correlate with race, gender, or socioeconomic status in the training data, the model learns those correlations too.
The old computing adage applies with new force: garbage in, garbage out. But in healthcare AI, the stakes are higher than bad spreadsheet outputs. We're talking about who gets referred for specialist care, who gets flagged as high-risk, whose symptoms get taken seriously, and whose get dismissed.
Algorithmic bias is not fundamentally an AI problem—it's a human one. AI systems encode the decisions we've already made, the data we've already collected, and the priorities we've already established. The algorithm doesn't create inequity; it operationalizes inequity at scale.
This is why technical fixes alone aren't enough. Addressing algorithmic bias requires confronting the structural inequities in healthcare that generated the biased data in the first place.
The Obermeyer Study: A Case Study in Systemic Bias
In 2019, a team led by Ziad Obermeyer at UC Berkeley published what would become the landmark paper on algorithmic bias in healthcare. The study examined a widely used commercial algorithm that helped hospitals identify patients who would benefit from "high-risk care management" programs—extra nursing support, closer monitoring, more primary care visits.
The algorithm was used on approximately 200 million Americans annually. It seemed to work well by standard metrics. But when Obermeyer's team examined the outcomes by race, they found something disturbing.
At any given risk score, Black patients were significantly sicker than white patients with the same score. The algorithm systematically underestimated the health needs of Black patients.
The impact: Fixing this disparity would increase the percentage of Black patients receiving additional help from 17.7% to 46.5%—nearly tripling access to care management programs.
How Did This Happen?
The algorithm's designers made a seemingly reasonable choice: they used healthcare costs as a proxy for healthcare needs. Patients who cost more to treat were presumably sicker and needed more intervention. This made intuitive sense and was easy to measure.
But here's the problem: healthcare costs don't just reflect health needs. They also reflect access to care. Black patients in America, due to systemic barriers—insurance disparities, geographic access, implicit bias in clinical encounters, historical mistrust of medical institutions—tend to receive less healthcare than white patients at equivalent levels of illness.
So the algorithm learned: Black patients = lower costs = lower risk scores = less need for intervention. The model wasn't racist in any intentional sense. It simply learned the pattern in the data. But that pattern encoded decades of healthcare inequity.
When the researchers reformulated the algorithm to predict actual health needs rather than costs, racial bias in outcomes dropped by 84%. The algorithm manufacturer worked with the researchers to implement changes. This demonstrates that bias, once identified, can often be mitigated—but it requires actively looking for it.
Where Bias Enters AI Systems
The Obermeyer study illustrates one pathway for bias, but there are many others. Understanding where bias enters helps you spot it—and advocate for mitigation.
1. Training Data Composition
AI models learn from data, and that data comes from somewhere. In healthcare:
- Over half of published clinical AI models use data from just two countries—the United States and China
- Many datasets overrepresent non-Hispanic white patients relative to actual population demographics
- Social determinants of health data are largely absent—only about 3% of clinical documentation contains any SDOH information
- Rare diseases and conditions affecting smaller populations are underrepresented, leading to worse model performance for these groups
When an algorithm is trained on imbalanced data, it performs worse for underrepresented groups. This isn't theoretical—it's been demonstrated across multiple domains.
2. The Dermatology Problem
Dermatology AI offers a vivid example. Studies have consistently shown that skin lesion classification algorithms perform significantly worse on darker skin tones. Why? The training data.
An analysis of dermatology educational materials—textbooks, lecture notes, journal articles—found that only 1 in 10 images showed skin in the black-brown range on the Fitzpatrick Scale.
A 2024 Northwestern study found that dermatologists accurately diagnosed about 38% of images overall, but only 34% of images showing darker skin. AI assistance improved accuracy, but the improvement was greater for lighter skin—the disparity persisted.
When AI image generators were asked to create medical images, Adobe Firefly showed 38% dark skin representation, but ChatGPT-4o showed only 6%, Stable Diffusion 8.7%, and Midjourney just 3.9%.
The bias compounds: medical education underrepresents darker skin → clinicians are less experienced with these presentations → diagnostic algorithms are trained on biased datasets → AI perpetuates and potentially amplifies the disparity.
3. Proxy Variables and Hidden Correlations
Even when you remove explicit demographic variables, bias can sneak in through proxies. ZIP code correlates with race and socioeconomic status. Insurance type correlates with employment and income. Medication lists correlate with access to care. The algorithm doesn't need to "see" race to learn racial patterns.
This is why the "just remove race from the algorithm" approach often fails. The correlations remain embedded in other variables. True debiasing requires understanding the causal structure of the data, not just removing surface-level features.
4. Label Bias: Who Defines "Ground Truth"?
Supervised machine learning requires labeled data—examples where we know the "right answer." But in medicine, who decides what's right?
- Diagnostic labels reflect the judgments of clinicians who may themselves have implicit biases
- Treatment decisions reflect what patients could access, not necessarily what they needed
- Outcome measures may not capture what matters to all patient populations
If the "ground truth" labels are themselves biased, the model learns to replicate biased judgments with machine efficiency.
5. Feedback Loops and Amplification
Perhaps most concerning: AI systems can create feedback loops that amplify initial biases. If an algorithm predicts certain patients are "low risk" and they receive less intervention, they may have worse outcomes—which then becomes training data confirming the algorithm's prediction.
Over time, small initial biases can compound into large disparities. The algorithm becomes a self-fulfilling prophecy.
The Gender Dimension
Bias isn't only about race. Gender disparities are pervasive in medical AI.
Consider cardiovascular disease: heart attacks present differently in women than in men. Women more often experience atypical symptoms—fatigue, nausea, back pain rather than classic crushing chest pain. Yet cardiovascular AI models are often trained predominantly on male data.
The result: AI systems that are less accurate at predicting heart disease in women, potentially contributing to the well-documented problem of women's cardiac symptoms being dismissed or misdiagnosed.
Similar patterns appear in:
- Pain assessment: Women's pain is historically undertreated; AI trained on treatment patterns may perpetuate this
- Mental health: Different presentation patterns and diagnostic rates by gender affect training data
- Drug dosing: Women were historically excluded from clinical trials; AI models may not accurately predict responses
- Autoimmune diseases: Conditions like lupus that disproportionately affect women may be underrepresented in general medical AI training sets
The historical exclusion of women from clinical trials until the 1990s means that much of the foundational medical data AI learns from was generated primarily from male subjects. We're still living with the consequences of that exclusion—now encoded in our algorithms.
Beyond Demographics: Other Axes of Bias
While race and gender receive the most attention, algorithmic bias extends across many dimensions:
Age
Elderly patients are often underrepresented in clinical trials and may present with atypical symptoms. AI models trained on younger populations may perform poorly for geriatric patients—precisely the population most likely to have complex medical needs.
Disability
Patients with disabilities may have different baseline vital signs, different communication patterns, and different care needs. If training data doesn't adequately represent these populations, AI recommendations may be inappropriate or even harmful.
Language and Literacy
Natural language processing models are predominantly trained on English text. Patients who speak other languages, or who have lower health literacy, may receive worse AI-assisted care. Clinical notes about these patients may also be systematically different in ways that affect model training.
Geographic Location
Rural patients face different disease prevalences, different access patterns, and different healthcare resources than urban patients. Models trained primarily on urban academic medical center data may not generalize well to rural settings.
Socioeconomic Status
Patients with lower socioeconomic status may have less complete medical records (due to fragmented care), different patterns of healthcare utilization, and different social determinants affecting their health. If AI learns to associate incomplete records with "lower risk," it may systematically underserve vulnerable populations.
These biases don't exist in isolation—they intersect. A Black woman, an elderly rural patient, a low-income person with a disability: these individuals exist at the intersection of multiple underrepresented categories. The compounding effect of intersectional bias can be greater than any single dimension alone.
The Regulatory Landscape
Regulators are beginning to respond. As of May 2024, the FDA has approved 882 AI-enabled medical devices—an unprecedented surge. But the regulatory framework, largely established in 1976, is struggling to keep pace.
Current Developments
- FDA: Implementing guidelines for AI safety and effectiveness, but primarily focused on high-level governance rather than technical bias detection
- WHO: Published ethics and governance guidance emphasizing fairness, equity, and explainability
- European Commission: The EU AI Act classifies medical AI as high-risk, requiring conformity assessments and bias documentation
- Health Canada: Developing frameworks for algorithmic transparency and bias reporting
The trend is toward greater scrutiny, but gaps remain. Current methods may be inadequate for detecting bias in complex AI systems, and enforcement is inconsistent.
What Clinicians Can Do
You may not build AI systems, but you use them—and your voice matters in how they're deployed. Here's how to engage with algorithmic bias as a clinician:
1. Ask Questions About the Tools You Use
- What data was this model trained on? What populations were represented?
- Has performance been validated across demographic groups?
- Are there known disparities in accuracy or outcomes?
- What's the process for reporting concerns about biased outputs?
2. Maintain Clinical Judgment
AI outputs are inputs to your decision-making, not replacements for it. When an algorithm's recommendation doesn't match your clinical intuition—especially for patients from underrepresented groups—trust your training. The model may be wrong in ways that reflect its training data limitations.
3. Document Discrepancies
If you notice an AI tool consistently performing poorly for certain patient populations, document it and report it. Systematic feedback is how these systems improve. Your observations from clinical practice are valuable data that algorithm developers often lack.
4. Advocate for Diverse Development Teams
Research shows that more diverse AI development teams build less biased systems. Only about 5% of active physicians identify as Black, and the percentage of underrepresented developers is even lower. Support initiatives that increase diversity in both medicine and technology.
5. Support Inclusive Data Collection
Better AI requires better data. This means:
- Documenting social determinants of health in clinical encounters
- Supporting research that actively recruits diverse populations
- Advocating for data sharing practices that don't perpetuate existing gaps
The Philosophical Stakes
Algorithmic bias forces us to confront uncomfortable questions about medicine itself:
- If AI reflects our historical decisions, what does it reveal about the care we've provided?
- If "objective" algorithms encode subjective judgments, what counts as unbiased care?
- If we deploy AI to scale healthcare, are we scaling equity or inequity?
These aren't just technical questions with technical solutions. They're ethical questions that require ethical engagement.
Here's the hopeful framing: AI makes bias visible in ways that human decision-making doesn't. We can audit an algorithm. We can measure its disparate impact. We can track whether interventions improve equity.
Human clinicians carry implicit biases that are harder to detect and correct. If we approach AI development thoughtfully, these systems could actually reduce healthcare disparities by making bias measurable and addressable. But only if we actively design for fairness rather than assuming it will emerge automatically.
Looking Forward
The field is evolving. Promising developments include:
- Fairness-aware machine learning: Techniques that explicitly optimize for equitable performance across groups, treating fairness as a constraint in the optimization process rather than an afterthought
- Algorithmic auditing: Systematic evaluation of AI systems for bias before and after deployment. Some organizations now require bias audits as part of their AI governance frameworks
- Datasheets for datasets: Standardized documentation of training data composition, limitations, and potential biases—similar to nutrition labels for food products
- Model cards: Documentation accompanying trained models that describes intended use cases, performance across populations, and known limitations
- Federated learning: Training models on distributed data without centralizing sensitive information, potentially enabling more diverse training sets while preserving privacy
- Synthetic data augmentation: Generating additional training examples for underrepresented groups to balance datasets—research shows this can reduce racial bias in some imaging applications
None of these are silver bullets. But together, they represent a maturing understanding that fairness must be intentionally designed, not assumed.
The Role of Diverse Teams
Technical solutions alone aren't enough. Research consistently shows that more diverse AI development teams build less biased systems. People with different lived experiences notice different potential failure modes. They ask different questions about who might be harmed.
This is why the composition of AI teams matters—not just for representation's sake, but for the quality of the products they build. Clinicians, patients, ethicists, and community representatives all have perspectives that pure technologists may lack.
Continuous Monitoring
Bias can emerge or worsen over time, even in systems that were initially fair. Population shifts, changes in clinical practice, and the feedback loops we discussed can all introduce or amplify disparities. This means bias detection isn't a one-time audit—it requires ongoing monitoring with stratified performance metrics.
Organizations deploying healthcare AI should establish processes for continuous bias monitoring, with clear escalation pathways when disparities are detected.
Readings
Books
Podcasts
Video
Key Researchers to Follow
Reflection Questions
- Think about AI tools you use or encounter in clinical practice. What populations might be underrepresented in their training data? How might this affect their recommendations?
- The Obermeyer study found that using healthcare costs as a proxy for health needs encoded racial bias. What other "reasonable" proxy measures might encode hidden biases?
- If AI makes bias more visible and measurable than human decision-making, does this represent an opportunity for improvement—or just a new form of the same old problem?
- How would you respond if you noticed an AI tool consistently giving different recommendations for similar patients from different demographic groups?
Learning Objectives
- Explain how AI systems inherit and potentially amplify biases from training data
- Describe the Obermeyer study and why using costs as a proxy for needs created racial bias
- Identify multiple pathways through which bias enters AI systems (data composition, proxy variables, label bias, feedback loops)
- Recognize how dermatology AI performance disparities illustrate the training data problem
- Apply a framework for questioning AI tools about their training data and validation across populations
- Articulate the philosophical argument that algorithmic bias reflects—and makes visible—human bias