Running AI Models on Your Own Computer
Technical guidance for running medical AI models locally—for privacy, offline access, and understanding how these systems work under the hood.
No current open-source medical model is validated for clinical decision-making. This guide is for educational and development purposes. Nothing here constitutes medical, legal, or compliance advice. Consult appropriate professionals before deploying AI systems in clinical settings.
Why Local Models Matter for Healthcare
Cloud-based AI services like ChatGPT and Claude offer powerful capabilities, but they present challenges for healthcare applications. Data sent to these services may be retained for training, lacks guaranteed HIPAA compliance without specific enterprise agreements, and requires constant internet connectivity.
Local models flip this equation. When you run an AI model on your own hardware, patient data never leaves your machine. There's no network transmission, no third-party storage, no Business Associate Agreement required for internal experimentation.
Ideal Use Cases for Local Models
Prototyping
Test clinical applications before investing in HIPAA-compliant cloud infrastructure.
Education
Understand how these systems behave without compliance overhead.
Offline Access
Work in settings with unreliable connectivity.
Research
Full control over model parameters and outputs.
Cost Management
High-volume experimentation without API fees.
Privacy
Data never leaves your machine.
Local models running on consumer hardware are smaller and less capable than frontier models like GPT-4 or Claude. A 7B parameter model on your laptop won't match a 1.8 trillion parameter model running on a data center. But for many tasks—answering medical questions, summarizing notes, generating draft content—smaller models can be surprisingly effective.
Medical-Specific Models
Several open-source models have been specifically trained or fine-tuned for healthcare applications. These offer better baseline performance on medical tasks than general-purpose models of similar size. However, "better baseline performance" doesn't mean "ready for clinical use"—all require validation for any specific application.
MedGemma
Released by Google DeepMind in May 2025, MedGemma represents the current state-of-the-art for open medical models. Built on Google's Gemma 3 architecture, these models underwent continued pre-training on diverse medical data while preserving general capabilities.
| Variant | Parameters | Modalities | Key Training Data |
|---|---|---|---|
| MedGemma 4B Multimodal | 4 billion | Text + Images | Radiology, dermatology, ophthalmology, histopathology |
| MedGemma 27B Text | 27 billion | Text only | Medical literature, clinical text |
| MedGemma 27B Multimodal | 27 billion | Text + Images + EHR | All above plus FHIR-based EHR data |
Performance Claims:
- MedGemma 4B scores 64.4% on MedQA, ranking among the best sub-8B open models
- In radiologist evaluation, 81% of its chest X-ray reports were judged sufficient for similar patient management
- MedGemma 27B Text scores 87.7% on MedQA—within 3 points of much larger proprietary models
Image Capabilities: The multimodal variants can process medical images including chest X-rays, dermatology photographs, fundus images, and histopathology slides.
Google explicitly states MedGemma is not clinical-grade. From their documentation: "MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case."
Early testing has revealed significant gaps. One clinician found MedGemma generated a "normal" interpretation for a chest X-ray with clear tuberculosis findings.
Hardware Requirements:
- MedGemma 4B: Runs on single consumer GPU (8GB+ VRAM) or Apple Silicon Macs with 16GB+ RAM
- MedGemma 27B: Requires substantial hardware—64GB+ RAM or professional GPUs with 24GB+ VRAM
Access: Available on Hugging Face (google/medgemma-4b-it for the instruction-tuned 4B variant). Requires agreeing to Google's terms of use.
Meditron
Developed by researchers at EPFL and Yale School of Medicine, Meditron adapts Meta's Llama 2 architecture through continued pretraining on curated medical data. The project emphasizes open access and low-resource settings.
Training Corpus (GAP-Replay, 48.1 billion tokens):
- Clinical guidelines: 46,000 guidelines from hospitals and international organizations (including ICRC)
- PubMed abstracts: 16.1 million abstracts from medical literature
- Medical papers: Full text from 5 million publicly available papers
- Replay data: 400 million tokens of general-domain content to prevent catastrophic forgetting
Model Variants:
- Meditron-7B: Smaller variant suitable for laptops with 16GB+ RAM at Q4 quantization
- Meditron-70B: Full-size variant that outperforms GPT-3.5 and Flan-PaLM on multiple medical reasoning benchmarks. Requires serious hardware.
Access: GitHub (epfLLM/meditron), Hugging Face. Available in Ollama as meditron:7b.
BioMistral
Built on the Mistral architecture and pre-trained on PubMed Central, BioMistral offers strong biomedical question-answering performance in a 7B parameter package. Its standout feature is multilingual evaluation—the team tested performance across eight languages.
Best Use Cases:
- Biomedical literature synthesis and analysis
- Research question-answering
- Applications requiring non-English language support
- Settings where smaller model size is important
Access: Hugging Face (BioMistral/BioMistral-7B), available in some Ollama model repositories.
Other Notable Models
OpenBioLLM-70B
Claims to outperform GPT-4 on several biomedical benchmarks. Built on Meta's Llama 3 with medical fine-tuning.
MedAlpaca-7B
Fine-tuned for medical dialogue and question-answering. Good for conversational medical education. Available in Ollama.
Meerkat-7B/8B
First 7B models to exceed USMLE passing threshold. Strong diagnostic reasoning on NEJM Case Challenges.
ClinicalBERT
Smaller encoder model trained on MIMIC-III clinical notes. Excellent for classification and entity recognition.
Choosing a Model
| Hardware | Recommended Models |
|---|---|
| Limited (8-16GB RAM) | Phi-3, MedAlpaca-7B, or Meditron-7B at Q4 |
| Medical QA focus | BioMistral-7B or Meditron-7B |
| Image + text tasks | MedGemma 4B (if hardware allows) |
| Research/academic | BioMistral for literature, Meditron for guidelines |
| Maximum capability | MedGemma 27B or OpenBioLLM-70B (significant hardware required) |
Hardware Requirements and Performance
Running local models requires understanding the relationship between model size, quantization, and your available hardware. The bottleneck is almost always memory—either system RAM or GPU VRAM.
Memory Math
A model's parameters are stored as floating-point numbers. In full precision (FP16), each parameter requires 2 bytes. So a 7B parameter model needs roughly 14GB just for the weights. Add overhead for inference and you're looking at 16-20GB for comfortable operation.
Most consumer hardware can't handle this, which is where quantization comes in.
Understanding Quantization
Quantization reduces precision from 16-bit floating point to smaller representations—8-bit, 4-bit, even 2-bit. This trades some accuracy for dramatically reduced memory requirements.
| Quantization | Bits/Parameter | 7B Model Size | Quality Impact |
|---|---|---|---|
| FP16 | 16 | ~14GB | Baseline |
| Q8_0 | 8 | ~7GB | Minimal loss |
| Q6_K | 6 | ~5.4GB | Very slight loss |
| Q5_K_M | 5 | ~4.7GB | Slight loss |
| Q4_K_M | 4 | ~3.8GB | Small but acceptable |
| Q3_K_M | 3 | ~3.0GB | Noticeable loss |
| Q2_K | 2 | ~2.7GB | Significant loss |
For most users, Q4_K_M is the sweet spot—good quality, reasonable size. The "_K_" notation indicates "K-quant" methods that intelligently quantize different parts of the model to preserve quality.
Practical Hardware Guidelines
8GB RAM (Minimum)
3B-4B parameter models at Q4. Slow inference, limited context. Examples: Phi-3 mini, TinyLlama.
16GB RAM (Comfortable)
7B models at Q4-Q5. Reasonable speed, 2K-4K context. Examples: MedGemma 4B, BioMistral 7B.
32GB RAM (Development)
7B at Q8 or 13B at Q4-Q5. Good inference speed, larger context windows.
64GB+ RAM (Serious)
27B+ models at Q4-Q5. Multiple models loaded simultaneously.
Apple Silicon Performance
M-series Macs deserve special mention. Their unified memory architecture—where CPU and GPU share the same RAM pool—eliminates the VRAM bottleneck that limits Windows/Linux GPU usage. A 16GB M1 Mac can effectively use all 16GB for model weights.
| Mac Configuration | Comfortable Model Size |
|---|---|
| M1/M2 8GB | 3B-4B at Q4 |
| M1/M2 16GB | 7B at Q4-Q5 |
| M1/M2/M3 24GB | 7B at Q8 or 13B at Q4 |
| M1/M2/M3 Max 32GB+ | 13B at Q5-Q8, 27B at Q4 |
| M1/M2/M3 Ultra 64GB+ | 27B-70B models |
GPU Acceleration
For NVIDIA GPUs, the key metric is VRAM:
| VRAM | Typical Cards | Model Capacity |
|---|---|---|
| 4GB | GTX 1650, RTX 3050 | 3B-4B models only |
| 8GB | RTX 3060, RTX 4060 | 7B at Q4-Q5 comfortably |
| 12GB | RTX 3060 12GB, RTX 4070 | 7B at Q8, 13B at Q4 |
| 24GB | RTX 4090, A5000 | 13B at Q8, 27B at Q4 |
Partial offloading: You don't need to fit the entire model in VRAM. Both Ollama and LM Studio support loading some layers on GPU and others in system RAM. Even 50% GPU offloading provides meaningful speedup.
Security Setup for HIPAA-Compliant Work
Even though local models don't transmit data externally, proper security hygiene matters. This section covers practical steps rather than comprehensive compliance frameworks.
Local deployment addresses data transmission to third parties. When you prompt ChatGPT with patient information, that data travels to external servers. Local models eliminate this—data stays on your machine. But local deployment doesn't automatically create a HIPAA-compliant system. For any work involving actual patient information, full organizational compliance infrastructure is required.
Full Disk Encryption
Non-negotiable for any machine that might touch patient data. If your laptop is stolen, encryption prevents data access without the decryption key.
macOS (FileVault)
- Open System Settings → Privacy & Security → FileVault
- Click "Turn On FileVault"
- Choose recovery key option (store securely, not on the same machine)
- Encryption proceeds in background; may take several hours
Windows (BitLocker)
- Open Control Panel → System and Security → BitLocker Drive Encryption
- Click "Turn on BitLocker" for your OS drive
- Choose how to unlock (PIN or USB key for stronger security)
- Save recovery key to secure location
- Choose encryption scope (entire drive recommended)
Network Isolation
Local models don't need internet access once downloaded:
- Run in airplane mode during sessions involving any patient context
- Use application-level blocking (Little Snitch on macOS, Windows Firewall)
- Consider air-gapped machines for highest sensitivity work
Data Handling Practices
- Never paste real patient identifiers into prompts
- Use synthetic data or de-identified case presentations for development
- Clear chat histories after sessions
- Disable persistent storage of conversations when possible
Installation: Ollama
Ollama is the simplest path to running local models. It handles model downloading, quantization management, and provides both a command-line interface and local API. Think of it like Docker for language models.
macOS Installation
Option 1: Direct download (recommended)
- Visit ollama.com/download
- Download the macOS installer (.dmg file, ~50MB)
- Open the .dmg and drag Ollama to Applications
- Launch Ollama from Applications folder
- A llama icon appears in your menu bar when running
Option 2: Homebrew
brew install ollama
brew services start ollama
Windows Installation
- Download the Windows installer from ollama.com/download
- Run the .exe installer
- Follow prompts; Ollama installs as a background service
- The llama icon appears in system tray when running
GPU Setup for NVIDIA: Before installing Ollama, ensure you have recent NVIDIA drivers from nvidia.com/drivers. Ollama will automatically detect and use your GPU.
Linux Installation
curl -fsSL https://ollama.com/install.sh | sh
Verifying Installation
ollama --version
ollama list
Downloading Models
Browse available models at ollama.com/library.
# Small, fast model for testing setup
ollama pull phi3
# Medical-focused model
ollama pull meditron:7b
# Specific quantization variant
ollama pull llama3.1:8b-q4_k_m
Running Models
# Interactive chat
ollama run phi3
# Single prompt (no interactive session)
ollama run phi3 "What are the most common causes of pediatric fever?"
Type /bye to exit interactive mode.
Model Parameters
# Set temperature (0.0 = deterministic, 1.0 = creative)
ollama run phi3 --temperature 0.7
# Set context window size
ollama run phi3 --ctx-size 4096
# Set number of tokens to generate
ollama run phi3 --num-predict 500
Useful Ollama Commands
| Command | Description |
|---|---|
ollama list |
List downloaded models with sizes |
ollama show meditron:7b |
Show model details |
ollama rm phi3 |
Remove a model |
ollama pull phi3 |
Pull latest version of a model |
ollama ps |
Show running models |
Local API
Ollama automatically serves an OpenAI-compatible API at http://localhost:11434:
curl http://localhost:11434/api/generate -d '{
"model": "phi3",
"prompt": "What are common causes of pediatric fever?"
}'
Installation: LM Studio
LM Studio provides a graphical interface for running local models, making it more approachable for users who prefer visual tools over command lines.
Installation
- Visit lmstudio.ai
- Download for your operating system
- macOS: Drag to Applications folder
- Windows: Run the installer
- Linux: Make the AppImage executable and run
Downloading Models
- Launch LM Studio
- Click "Discover" in the left navigation
- Search for a model (try "phi3" or "meditron")
- Click Download on your chosen variant
Running Models
- Click "Chat" in left navigation
- Click the dropdown at top to select your downloaded model
- Click "Load" to load the model into memory
- Start typing in the chat interface
GPU Offloading
- Go to model settings (gear icon)
- Adjust "GPU Layers" slider
- More layers = faster but more VRAM required
- Start with 50% and adjust based on performance
Comparing Ollama and LM Studio
| Aspect | Ollama | LM Studio |
|---|---|---|
| Interface | Command line + API | Graphical + API |
| Learning curve | Steeper initially | Gentler start |
| Resource usage | Lighter | Heavier |
| Model discovery | Manual download or pull | Built-in browser |
| Scripting/automation | Excellent | Limited |
| Best for | Developers, automation | Exploration, non-technical users |
Many users install both: LM Studio for discovery and quick testing, Ollama for integration and automation.
Critical Limitations to Understand
No current open-source medical model is validated for clinical decision-making. MedGemma's documentation explicitly states it requires "appropriate validation, adaptation and/or meaningful modification" before any clinical use. Meditron's team "recommends against using Meditron in medical applications without extensive use-case alignment."
Real-World Failure Modes
1 False Precision
A model might confidently state "the sensitivity of rapid strep testing is 86%" when the actual figure varies by test manufacturer, technique, and population. The confident delivery masks significant uncertainty.
2 Pattern Matching Errors
Presented with "12-year-old with sore throat and fever," a model may generate a reasonable differential. But it can't actually examine the patient, notice the sandpaper rash suggesting scarlet fever, or appreciate that the child looks toxic versus well-appearing.
3 Outdated Information
Training data has a cutoff date. Guidelines change, drug interactions are discovered, new variants emerge. A model has no way to know that recommendations shifted six months ago.
4 Plausible but Wrong
One early tester found MedGemma generated a "normal" interpretation for a chest X-ray with obvious tuberculosis findings. The output was grammatically perfect, appropriately formatted, and completely wrong.
Benchmark Scores Don't Equal Clinical Competence
A model scoring 87% on MedQA (multiple-choice medical questions) isn't "87% as good as a doctor." MedQA measures pattern recognition and fact recall. Success does not require:
- Integration of physical examination findings
- Patient communication and history-taking
- Weighing benefits and harms in individual contexts
- Recognizing when a question is unanswerable
- Knowing the limits of one's own knowledge
- Managing uncertainty over time
The Hallucination Problem
All language models sometimes generate confident, fluent, and completely wrong outputs. This is an intrinsic architectural limitation, not a bug to be fixed.
In medical contexts, hallucination could mean:
- Inventing drug interactions that don't exist
- Citing non-existent studies with realistic-sounding titles
- Missing critical differential diagnoses while confidently presenting an incomplete list
- Providing incorrect dosing information
What Local Models Can Reasonably Do
Appropriate Uses
- Draft content for human review
- Answer general medical questions (consumer health level)
- Summarization with verification
- Development and prototyping
- Education and understanding AI behavior
Inappropriate Uses
- Patient-specific diagnostic recommendations
- Replacing clinical decision-making workflows
- Operating autonomously in patient-facing applications
- Handling real PHI without compliance infrastructure
- Informing care without independent verification
Resources
Model Repositories
Technical Documentation
Medical AI Evaluation
- Open Medical-LLM Leaderboard (Hugging Face)
- MedLLMs Practical Guide