RESOURCES

Running AI Models on Your Own Computer

Technical guidance for running medical AI models locally—for privacy, offline access, and understanding how these systems work under the hood.

~45 min read Technical guide

Important Disclaimer

No current open-source medical model is validated for clinical decision-making. This guide is for educational and development purposes. Nothing here constitutes medical, legal, or compliance advice. Consult appropriate professionals before deploying AI systems in clinical settings.

Why Local Models Matter for Healthcare

Cloud-based AI services like ChatGPT and Claude offer powerful capabilities, but they present challenges for healthcare applications. Data sent to these services may be retained for training, lacks guaranteed HIPAA compliance without specific enterprise agreements, and requires constant internet connectivity.

Local models flip this equation. When you run an AI model on your own hardware, patient data never leaves your machine. There's no network transmission, no third-party storage, no Business Associate Agreement required for internal experimentation.

Ideal Use Cases for Local Models

Prototyping

Test clinical applications before investing in HIPAA-compliant cloud infrastructure.

Education

Understand how these systems behave without compliance overhead.

Offline Access

Work in settings with unreliable connectivity.

Research

Full control over model parameters and outputs.

Cost Management

High-volume experimentation without API fees.

Privacy

Data never leaves your machine.

The Tradeoff

Local models running on consumer hardware are smaller and less capable than frontier models like GPT-4 or Claude. A 7B parameter model on your laptop won't match a 1.8 trillion parameter model running on a data center. But for many tasks—answering medical questions, summarizing notes, generating draft content—smaller models can be surprisingly effective.

Medical-Specific Models

Several open-source models have been specifically trained or fine-tuned for healthcare applications. These offer better baseline performance on medical tasks than general-purpose models of similar size. However, "better baseline performance" doesn't mean "ready for clinical use"—all require validation for any specific application.

MedGemma

Released by Google DeepMind in May 2025, MedGemma represents the current state-of-the-art for open medical models. Built on Google's Gemma 3 architecture, these models underwent continued pre-training on diverse medical data while preserving general capabilities.

Variant	Parameters	Modalities	Key Training Data
MedGemma 4B Multimodal	4 billion	Text + Images	Radiology, dermatology, ophthalmology, histopathology
MedGemma 27B Text	27 billion	Text only	Medical literature, clinical text
MedGemma 27B Multimodal	27 billion	Text + Images + EHR	All above plus FHIR-based EHR data

Performance Claims:

MedGemma 4B scores 64.4% on MedQA, ranking among the best sub-8B open models
In radiologist evaluation, 81% of its chest X-ray reports were judged sufficient for similar patient management
MedGemma 27B Text scores 87.7% on MedQA—within 3 points of much larger proprietary models

Image Capabilities: The multimodal variants can process medical images including chest X-rays, dermatology photographs, fundus images, and histopathology slides.

Important Limitations

Google explicitly states MedGemma is not clinical-grade. From their documentation: "MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case."

Early testing has revealed significant gaps. One clinician found MedGemma generated a "normal" interpretation for a chest X-ray with clear tuberculosis findings.

Hardware Requirements:

MedGemma 4B: Runs on single consumer GPU (8GB+ VRAM) or Apple Silicon Macs with 16GB+ RAM
MedGemma 27B: Requires substantial hardware—64GB+ RAM or professional GPUs with 24GB+ VRAM

Access: Available on Hugging Face (google/medgemma-4b-it for the instruction-tuned 4B variant). Requires agreeing to Google's terms of use.

Meditron

Developed by researchers at EPFL and Yale School of Medicine, Meditron adapts Meta's Llama 2 architecture through continued pretraining on curated medical data. The project emphasizes open access and low-resource settings.

Training Corpus (GAP-Replay, 48.1 billion tokens):

Clinical guidelines: 46,000 guidelines from hospitals and international organizations (including ICRC)
PubMed abstracts: 16.1 million abstracts from medical literature
Medical papers: Full text from 5 million publicly available papers
Replay data: 400 million tokens of general-domain content to prevent catastrophic forgetting

Model Variants:

Meditron-7B: Smaller variant suitable for laptops with 16GB+ RAM at Q4 quantization
Meditron-70B: Full-size variant that outperforms GPT-3.5 and Flan-PaLM on multiple medical reasoning benchmarks. Requires serious hardware.

Access: GitHub (epfLLM/meditron), Hugging Face. Available in Ollama as meditron:7b.

BioMistral

Built on the Mistral architecture and pre-trained on PubMed Central, BioMistral offers strong biomedical question-answering performance in a 7B parameter package. Its standout feature is multilingual evaluation—the team tested performance across eight languages.

Best Use Cases:

Biomedical literature synthesis and analysis
Research question-answering
Applications requiring non-English language support
Settings where smaller model size is important

Access: Hugging Face (BioMistral/BioMistral-7B), available in some Ollama model repositories.

Other Notable Models

OpenBioLLM-70B

Claims to outperform GPT-4 on several biomedical benchmarks. Built on Meta's Llama 3 with medical fine-tuning.

MedAlpaca-7B

Fine-tuned for medical dialogue and question-answering. Good for conversational medical education. Available in Ollama.

Meerkat-7B/8B

First 7B models to exceed USMLE passing threshold. Strong diagnostic reasoning on NEJM Case Challenges.

ClinicalBERT

Smaller encoder model trained on MIMIC-III clinical notes. Excellent for classification and entity recognition.

Choosing a Model

Hardware	Recommended Models
Limited (8-16GB RAM)	Phi-3, MedAlpaca-7B, or Meditron-7B at Q4
Medical QA focus	BioMistral-7B or Meditron-7B
Image + text tasks	MedGemma 4B (if hardware allows)
Research/academic	BioMistral for literature, Meditron for guidelines
Maximum capability	MedGemma 27B or OpenBioLLM-70B (significant hardware required)

Hardware Requirements and Performance

Running local models requires understanding the relationship between model size, quantization, and your available hardware. The bottleneck is almost always memory—either system RAM or GPU VRAM.

Memory Math

A model's parameters are stored as floating-point numbers. In full precision (FP16), each parameter requires 2 bytes. So a 7B parameter model needs roughly 14GB just for the weights. Add overhead for inference and you're looking at 16-20GB for comfortable operation.

Most consumer hardware can't handle this, which is where quantization comes in.

Understanding Quantization

Quantization reduces precision from 16-bit floating point to smaller representations—8-bit, 4-bit, even 2-bit. This trades some accuracy for dramatically reduced memory requirements.

Quantization	Bits/Parameter	7B Model Size	Quality Impact
FP16	16	~14GB	Baseline
Q8_0	8	~7GB	Minimal loss
Q6_K	6	~5.4GB	Very slight loss
Q5_K_M	5	~4.7GB	Slight loss
Q4_K_M	4	~3.8GB	Small but acceptable
Q3_K_M	3	~3.0GB	Noticeable loss
Q2_K	2	~2.7GB	Significant loss

For most users, Q4_K_M is the sweet spot—good quality, reasonable size. The "_K_" notation indicates "K-quant" methods that intelligently quantize different parts of the model to preserve quality.

Practical Hardware Guidelines

8GB RAM (Minimum)

3B-4B parameter models at Q4. Slow inference, limited context. Examples: Phi-3 mini, TinyLlama.

16GB RAM (Comfortable)

7B models at Q4-Q5. Reasonable speed, 2K-4K context. Examples: MedGemma 4B, BioMistral 7B.

32GB RAM (Development)

7B at Q8 or 13B at Q4-Q5. Good inference speed, larger context windows.

64GB+ RAM (Serious)

27B+ models at Q4-Q5. Multiple models loaded simultaneously.

Apple Silicon Performance

M-series Macs deserve special mention. Their unified memory architecture—where CPU and GPU share the same RAM pool—eliminates the VRAM bottleneck that limits Windows/Linux GPU usage. A 16GB M1 Mac can effectively use all 16GB for model weights.

Mac Configuration	Comfortable Model Size
M1/M2 8GB	3B-4B at Q4
M1/M2 16GB	7B at Q4-Q5
M1/M2/M3 24GB	7B at Q8 or 13B at Q4
M1/M2/M3 Max 32GB+	13B at Q5-Q8, 27B at Q4
M1/M2/M3 Ultra 64GB+	27B-70B models

GPU Acceleration

For NVIDIA GPUs, the key metric is VRAM:

VRAM	Typical Cards	Model Capacity
4GB	GTX 1650, RTX 3050	3B-4B models only
8GB	RTX 3060, RTX 4060	7B at Q4-Q5 comfortably
12GB	RTX 3060 12GB, RTX 4070	7B at Q8, 13B at Q4
24GB	RTX 4090, A5000	13B at Q8, 27B at Q4

Partial offloading: You don't need to fit the entire model in VRAM. Both Ollama and LM Studio support loading some layers on GPU and others in system RAM. Even 50% GPU offloading provides meaningful speedup.

Security Setup for HIPAA-Compliant Work

Even though local models don't transmit data externally, proper security hygiene matters. This section covers practical steps rather than comprehensive compliance frameworks.

What Local Models Solve

Local deployment addresses data transmission to third parties. When you prompt ChatGPT with patient information, that data travels to external servers. Local models eliminate this—data stays on your machine. But local deployment doesn't automatically create a HIPAA-compliant system. For any work involving actual patient information, full organizational compliance infrastructure is required.

Full Disk Encryption

Non-negotiable for any machine that might touch patient data. If your laptop is stolen, encryption prevents data access without the decryption key.

macOS (FileVault)

Open System Settings → Privacy & Security → FileVault
Click "Turn On FileVault"
Choose recovery key option (store securely, not on the same machine)
Encryption proceeds in background; may take several hours

Windows (BitLocker)

Open Control Panel → System and Security → BitLocker Drive Encryption
Click "Turn on BitLocker" for your OS drive
Choose how to unlock (PIN or USB key for stronger security)
Save recovery key to secure location
Choose encryption scope (entire drive recommended)

Network Isolation

Local models don't need internet access once downloaded:

Run in airplane mode during sessions involving any patient context
Use application-level blocking (Little Snitch on macOS, Windows Firewall)
Consider air-gapped machines for highest sensitivity work

Data Handling Practices

Never paste real patient identifiers into prompts
Use synthetic data or de-identified case presentations for development
Clear chat histories after sessions
Disable persistent storage of conversations when possible

Installation: Ollama

Ollama is the simplest path to running local models. It handles model downloading, quantization management, and provides both a command-line interface and local API. Think of it like Docker for language models.

macOS Installation

Option 1: Direct download (recommended)

Visit ollama.com/download
Download the macOS installer (.dmg file, ~50MB)
Open the .dmg and drag Ollama to Applications
Launch Ollama from Applications folder
A llama icon appears in your menu bar when running

Option 2: Homebrew

brew install ollama
brew services start ollama

Windows Installation

Download the Windows installer from ollama.com/download
Run the .exe installer
Follow prompts; Ollama installs as a background service
The llama icon appears in system tray when running

GPU Setup for NVIDIA: Before installing Ollama, ensure you have recent NVIDIA drivers from nvidia.com/drivers. Ollama will automatically detect and use your GPU.

Linux Installation

curl -fsSL https://ollama.com/install.sh | sh

Verifying Installation

ollama --version
ollama list

Downloading Models

Browse available models at ollama.com/library.

# Small, fast model for testing setup
ollama pull phi3

# Medical-focused model
ollama pull meditron:7b

# Specific quantization variant
ollama pull llama3.1:8b-q4_k_m

Running Models

# Interactive chat
ollama run phi3

# Single prompt (no interactive session)
ollama run phi3 "What are the most common causes of pediatric fever?"

Type /bye to exit interactive mode.

Model Parameters

# Set temperature (0.0 = deterministic, 1.0 = creative)
ollama run phi3 --temperature 0.7

# Set context window size
ollama run phi3 --ctx-size 4096

# Set number of tokens to generate
ollama run phi3 --num-predict 500

Useful Ollama Commands

Command	Description
`ollama list`	List downloaded models with sizes
`ollama show meditron:7b`	Show model details
`ollama rm phi3`	Remove a model
`ollama pull phi3`	Pull latest version of a model
`ollama ps`	Show running models

Local API

Ollama automatically serves an OpenAI-compatible API at http://localhost:11434:

curl http://localhost:11434/api/generate -d '{
  "model": "phi3",
  "prompt": "What are common causes of pediatric fever?"
}'

Installation: LM Studio

LM Studio provides a graphical interface for running local models, making it more approachable for users who prefer visual tools over command lines.

Installation

Visit lmstudio.ai
Download for your operating system
macOS: Drag to Applications folder
Windows: Run the installer
Linux: Make the AppImage executable and run

Downloading Models

Launch LM Studio
Click "Discover" in the left navigation
Search for a model (try "phi3" or "meditron")
Click Download on your chosen variant

Running Models

Click "Chat" in left navigation
Click the dropdown at top to select your downloaded model
Click "Load" to load the model into memory
Start typing in the chat interface

GPU Offloading

Go to model settings (gear icon)
Adjust "GPU Layers" slider
More layers = faster but more VRAM required
Start with 50% and adjust based on performance

Comparing Ollama and LM Studio

Aspect	Ollama	LM Studio
Interface	Command line + API	Graphical + API
Learning curve	Steeper initially	Gentler start
Resource usage	Lighter	Heavier
Model discovery	Manual download or pull	Built-in browser
Scripting/automation	Excellent	Limited
Best for	Developers, automation	Exploration, non-technical users

Many users install both: LM Studio for discovery and quick testing, Ollama for integration and automation.

Critical Limitations to Understand

These Models Are Not Clinical Grade

No current open-source medical model is validated for clinical decision-making. MedGemma's documentation explicitly states it requires "appropriate validation, adaptation and/or meaningful modification" before any clinical use. Meditron's team "recommends against using Meditron in medical applications without extensive use-case alignment."

Real-World Failure Modes

1 False Precision

A model might confidently state "the sensitivity of rapid strep testing is 86%" when the actual figure varies by test manufacturer, technique, and population. The confident delivery masks significant uncertainty.

2 Pattern Matching Errors

Presented with "12-year-old with sore throat and fever," a model may generate a reasonable differential. But it can't actually examine the patient, notice the sandpaper rash suggesting scarlet fever, or appreciate that the child looks toxic versus well-appearing.

3 Outdated Information

Training data has a cutoff date. Guidelines change, drug interactions are discovered, new variants emerge. A model has no way to know that recommendations shifted six months ago.

4 Plausible but Wrong

One early tester found MedGemma generated a "normal" interpretation for a chest X-ray with obvious tuberculosis findings. The output was grammatically perfect, appropriately formatted, and completely wrong.

Benchmark Scores Don't Equal Clinical Competence

A model scoring 87% on MedQA (multiple-choice medical questions) isn't "87% as good as a doctor." MedQA measures pattern recognition and fact recall. Success does not require:

Integration of physical examination findings
Patient communication and history-taking
Weighing benefits and harms in individual contexts
Recognizing when a question is unanswerable
Knowing the limits of one's own knowledge
Managing uncertainty over time

The Hallucination Problem

All language models sometimes generate confident, fluent, and completely wrong outputs. This is an intrinsic architectural limitation, not a bug to be fixed.

In medical contexts, hallucination could mean:

Inventing drug interactions that don't exist
Citing non-existent studies with realistic-sounding titles
Missing critical differential diagnoses while confidently presenting an incomplete list
Providing incorrect dosing information

What Local Models Can Reasonably Do

Appropriate Uses

Draft content for human review
Answer general medical questions (consumer health level)
Summarization with verification
Development and prototyping
Education and understanding AI behavior

Inappropriate Uses

Patient-specific diagnostic recommendations
Replacing clinical decision-making workflows
Operating autonomously in patient-facing applications
Handling real PHI without compliance infrastructure
Informing care without independent verification

Why Local Models Matter for Healthcare

Ideal Use Cases for Local Models

Prototyping

Education

Offline Access

Research

Cost Management

Privacy

Medical-Specific Models

MedGemma

Meditron

BioMistral

Other Notable Models

OpenBioLLM-70B

MedAlpaca-7B

Meerkat-7B/8B

ClinicalBERT

Choosing a Model

Hardware Requirements and Performance

Memory Math

Understanding Quantization

Practical Hardware Guidelines

8GB RAM (Minimum)

16GB RAM (Comfortable)

32GB RAM (Development)

64GB+ RAM (Serious)

Apple Silicon Performance

GPU Acceleration

Security Setup for HIPAA-Compliant Work

Full Disk Encryption

macOS (FileVault)

Windows (BitLocker)

Network Isolation

Data Handling Practices

Installation: Ollama

macOS Installation

Windows Installation

Linux Installation

Verifying Installation

Downloading Models

Running Models

Model Parameters

Useful Ollama Commands

Local API

Installation: LM Studio

Installation

Downloading Models

Running Models

GPU Offloading

Comparing Ollama and LM Studio

Critical Limitations to Understand

Real-World Failure Modes

1 False Precision

2 Pattern Matching Errors

3 Outdated Information

4 Plausible but Wrong

Benchmark Scores Don't Equal Clinical Competence

The Hallucination Problem

What Local Models Can Reasonably Do

Appropriate Uses

Inappropriate Uses

Resources

Model Repositories

Technical Documentation

Medical AI Evaluation

Community