RESOURCES

Running AI Models on Your Own Computer

Technical guidance for running medical AI models locally—for privacy, offline access, and understanding how these systems work under the hood.

~45 min read Technical guide
Important Disclaimer

No current open-source medical model is validated for clinical decision-making. This guide is for educational and development purposes. Nothing here constitutes medical, legal, or compliance advice. Consult appropriate professionals before deploying AI systems in clinical settings.

Why Local Models Matter for Healthcare

Cloud-based AI services like ChatGPT and Claude offer powerful capabilities, but they present challenges for healthcare applications. Data sent to these services may be retained for training, lacks guaranteed HIPAA compliance without specific enterprise agreements, and requires constant internet connectivity.

Local models flip this equation. When you run an AI model on your own hardware, patient data never leaves your machine. There's no network transmission, no third-party storage, no Business Associate Agreement required for internal experimentation.

Ideal Use Cases for Local Models

Prototyping

Test clinical applications before investing in HIPAA-compliant cloud infrastructure.

Education

Understand how these systems behave without compliance overhead.

Offline Access

Work in settings with unreliable connectivity.

Research

Full control over model parameters and outputs.

Cost Management

High-volume experimentation without API fees.

Privacy

Data never leaves your machine.

The Tradeoff

Local models running on consumer hardware are smaller and less capable than frontier models like GPT-4 or Claude. A 7B parameter model on your laptop won't match a 1.8 trillion parameter model running on a data center. But for many tasks—answering medical questions, summarizing notes, generating draft content—smaller models can be surprisingly effective.


Medical-Specific Models

Several open-source models have been specifically trained or fine-tuned for healthcare applications. These offer better baseline performance on medical tasks than general-purpose models of similar size. However, "better baseline performance" doesn't mean "ready for clinical use"—all require validation for any specific application.

MedGemma

Released by Google DeepMind in May 2025, MedGemma represents the current state-of-the-art for open medical models. Built on Google's Gemma 3 architecture, these models underwent continued pre-training on diverse medical data while preserving general capabilities.

Variant Parameters Modalities Key Training Data
MedGemma 4B Multimodal 4 billion Text + Images Radiology, dermatology, ophthalmology, histopathology
MedGemma 27B Text 27 billion Text only Medical literature, clinical text
MedGemma 27B Multimodal 27 billion Text + Images + EHR All above plus FHIR-based EHR data

Performance Claims:

Image Capabilities: The multimodal variants can process medical images including chest X-rays, dermatology photographs, fundus images, and histopathology slides.

Important Limitations

Google explicitly states MedGemma is not clinical-grade. From their documentation: "MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case."

Early testing has revealed significant gaps. One clinician found MedGemma generated a "normal" interpretation for a chest X-ray with clear tuberculosis findings.

Hardware Requirements:

Access: Available on Hugging Face (google/medgemma-4b-it for the instruction-tuned 4B variant). Requires agreeing to Google's terms of use.

Meditron

Developed by researchers at EPFL and Yale School of Medicine, Meditron adapts Meta's Llama 2 architecture through continued pretraining on curated medical data. The project emphasizes open access and low-resource settings.

Training Corpus (GAP-Replay, 48.1 billion tokens):

Model Variants:

Access: GitHub (epfLLM/meditron), Hugging Face. Available in Ollama as meditron:7b.

BioMistral

Built on the Mistral architecture and pre-trained on PubMed Central, BioMistral offers strong biomedical question-answering performance in a 7B parameter package. Its standout feature is multilingual evaluation—the team tested performance across eight languages.

Best Use Cases:

Access: Hugging Face (BioMistral/BioMistral-7B), available in some Ollama model repositories.

Other Notable Models

OpenBioLLM-70B

Claims to outperform GPT-4 on several biomedical benchmarks. Built on Meta's Llama 3 with medical fine-tuning.

MedAlpaca-7B

Fine-tuned for medical dialogue and question-answering. Good for conversational medical education. Available in Ollama.

Meerkat-7B/8B

First 7B models to exceed USMLE passing threshold. Strong diagnostic reasoning on NEJM Case Challenges.

ClinicalBERT

Smaller encoder model trained on MIMIC-III clinical notes. Excellent for classification and entity recognition.

Choosing a Model

Hardware Recommended Models
Limited (8-16GB RAM) Phi-3, MedAlpaca-7B, or Meditron-7B at Q4
Medical QA focus BioMistral-7B or Meditron-7B
Image + text tasks MedGemma 4B (if hardware allows)
Research/academic BioMistral for literature, Meditron for guidelines
Maximum capability MedGemma 27B or OpenBioLLM-70B (significant hardware required)

Hardware Requirements and Performance

Running local models requires understanding the relationship between model size, quantization, and your available hardware. The bottleneck is almost always memory—either system RAM or GPU VRAM.

Memory Math

A model's parameters are stored as floating-point numbers. In full precision (FP16), each parameter requires 2 bytes. So a 7B parameter model needs roughly 14GB just for the weights. Add overhead for inference and you're looking at 16-20GB for comfortable operation.

Most consumer hardware can't handle this, which is where quantization comes in.

Understanding Quantization

Quantization reduces precision from 16-bit floating point to smaller representations—8-bit, 4-bit, even 2-bit. This trades some accuracy for dramatically reduced memory requirements.

Quantization Bits/Parameter 7B Model Size Quality Impact
FP16 16 ~14GB Baseline
Q8_0 8 ~7GB Minimal loss
Q6_K 6 ~5.4GB Very slight loss
Q5_K_M 5 ~4.7GB Slight loss
Q4_K_M 4 ~3.8GB Small but acceptable
Q3_K_M 3 ~3.0GB Noticeable loss
Q2_K 2 ~2.7GB Significant loss

For most users, Q4_K_M is the sweet spot—good quality, reasonable size. The "_K_" notation indicates "K-quant" methods that intelligently quantize different parts of the model to preserve quality.

Practical Hardware Guidelines

8

8GB RAM (Minimum)

3B-4B parameter models at Q4. Slow inference, limited context. Examples: Phi-3 mini, TinyLlama.

16

16GB RAM (Comfortable)

7B models at Q4-Q5. Reasonable speed, 2K-4K context. Examples: MedGemma 4B, BioMistral 7B.

32

32GB RAM (Development)

7B at Q8 or 13B at Q4-Q5. Good inference speed, larger context windows.

64

64GB+ RAM (Serious)

27B+ models at Q4-Q5. Multiple models loaded simultaneously.

Apple Silicon Performance

M-series Macs deserve special mention. Their unified memory architecture—where CPU and GPU share the same RAM pool—eliminates the VRAM bottleneck that limits Windows/Linux GPU usage. A 16GB M1 Mac can effectively use all 16GB for model weights.

Mac Configuration Comfortable Model Size
M1/M2 8GB 3B-4B at Q4
M1/M2 16GB 7B at Q4-Q5
M1/M2/M3 24GB 7B at Q8 or 13B at Q4
M1/M2/M3 Max 32GB+ 13B at Q5-Q8, 27B at Q4
M1/M2/M3 Ultra 64GB+ 27B-70B models

GPU Acceleration

For NVIDIA GPUs, the key metric is VRAM:

VRAM Typical Cards Model Capacity
4GB GTX 1650, RTX 3050 3B-4B models only
8GB RTX 3060, RTX 4060 7B at Q4-Q5 comfortably
12GB RTX 3060 12GB, RTX 4070 7B at Q8, 13B at Q4
24GB RTX 4090, A5000 13B at Q8, 27B at Q4

Partial offloading: You don't need to fit the entire model in VRAM. Both Ollama and LM Studio support loading some layers on GPU and others in system RAM. Even 50% GPU offloading provides meaningful speedup.


Security Setup for HIPAA-Compliant Work

Even though local models don't transmit data externally, proper security hygiene matters. This section covers practical steps rather than comprehensive compliance frameworks.

What Local Models Solve

Local deployment addresses data transmission to third parties. When you prompt ChatGPT with patient information, that data travels to external servers. Local models eliminate this—data stays on your machine. But local deployment doesn't automatically create a HIPAA-compliant system. For any work involving actual patient information, full organizational compliance infrastructure is required.

Full Disk Encryption

Non-negotiable for any machine that might touch patient data. If your laptop is stolen, encryption prevents data access without the decryption key.

macOS (FileVault)

  1. Open System Settings → Privacy & Security → FileVault
  2. Click "Turn On FileVault"
  3. Choose recovery key option (store securely, not on the same machine)
  4. Encryption proceeds in background; may take several hours

Windows (BitLocker)

  1. Open Control Panel → System and Security → BitLocker Drive Encryption
  2. Click "Turn on BitLocker" for your OS drive
  3. Choose how to unlock (PIN or USB key for stronger security)
  4. Save recovery key to secure location
  5. Choose encryption scope (entire drive recommended)

Network Isolation

Local models don't need internet access once downloaded:

Data Handling Practices


Installation: Ollama

Ollama is the simplest path to running local models. It handles model downloading, quantization management, and provides both a command-line interface and local API. Think of it like Docker for language models.

macOS Installation

Option 1: Direct download (recommended)

  1. Visit ollama.com/download
  2. Download the macOS installer (.dmg file, ~50MB)
  3. Open the .dmg and drag Ollama to Applications
  4. Launch Ollama from Applications folder
  5. A llama icon appears in your menu bar when running

Option 2: Homebrew

brew install ollama
brew services start ollama

Windows Installation

  1. Download the Windows installer from ollama.com/download
  2. Run the .exe installer
  3. Follow prompts; Ollama installs as a background service
  4. The llama icon appears in system tray when running

GPU Setup for NVIDIA: Before installing Ollama, ensure you have recent NVIDIA drivers from nvidia.com/drivers. Ollama will automatically detect and use your GPU.

Linux Installation

curl -fsSL https://ollama.com/install.sh | sh

Verifying Installation

ollama --version
ollama list

Downloading Models

Browse available models at ollama.com/library.

# Small, fast model for testing setup
ollama pull phi3

# Medical-focused model
ollama pull meditron:7b

# Specific quantization variant
ollama pull llama3.1:8b-q4_k_m

Running Models

# Interactive chat
ollama run phi3

# Single prompt (no interactive session)
ollama run phi3 "What are the most common causes of pediatric fever?"

Type /bye to exit interactive mode.

Model Parameters

# Set temperature (0.0 = deterministic, 1.0 = creative)
ollama run phi3 --temperature 0.7

# Set context window size
ollama run phi3 --ctx-size 4096

# Set number of tokens to generate
ollama run phi3 --num-predict 500

Useful Ollama Commands

Command Description
ollama list List downloaded models with sizes
ollama show meditron:7b Show model details
ollama rm phi3 Remove a model
ollama pull phi3 Pull latest version of a model
ollama ps Show running models

Local API

Ollama automatically serves an OpenAI-compatible API at http://localhost:11434:

curl http://localhost:11434/api/generate -d '{
  "model": "phi3",
  "prompt": "What are common causes of pediatric fever?"
}'

Installation: LM Studio

LM Studio provides a graphical interface for running local models, making it more approachable for users who prefer visual tools over command lines.

Installation

  1. Visit lmstudio.ai
  2. Download for your operating system
  3. macOS: Drag to Applications folder
  4. Windows: Run the installer
  5. Linux: Make the AppImage executable and run

Downloading Models

  1. Launch LM Studio
  2. Click "Discover" in the left navigation
  3. Search for a model (try "phi3" or "meditron")
  4. Click Download on your chosen variant

Running Models

  1. Click "Chat" in left navigation
  2. Click the dropdown at top to select your downloaded model
  3. Click "Load" to load the model into memory
  4. Start typing in the chat interface

GPU Offloading

  1. Go to model settings (gear icon)
  2. Adjust "GPU Layers" slider
  3. More layers = faster but more VRAM required
  4. Start with 50% and adjust based on performance

Comparing Ollama and LM Studio

Aspect Ollama LM Studio
Interface Command line + API Graphical + API
Learning curve Steeper initially Gentler start
Resource usage Lighter Heavier
Model discovery Manual download or pull Built-in browser
Scripting/automation Excellent Limited
Best for Developers, automation Exploration, non-technical users

Many users install both: LM Studio for discovery and quick testing, Ollama for integration and automation.


Critical Limitations to Understand

These Models Are Not Clinical Grade

No current open-source medical model is validated for clinical decision-making. MedGemma's documentation explicitly states it requires "appropriate validation, adaptation and/or meaningful modification" before any clinical use. Meditron's team "recommends against using Meditron in medical applications without extensive use-case alignment."

Real-World Failure Modes

1 False Precision

A model might confidently state "the sensitivity of rapid strep testing is 86%" when the actual figure varies by test manufacturer, technique, and population. The confident delivery masks significant uncertainty.

2 Pattern Matching Errors

Presented with "12-year-old with sore throat and fever," a model may generate a reasonable differential. But it can't actually examine the patient, notice the sandpaper rash suggesting scarlet fever, or appreciate that the child looks toxic versus well-appearing.

3 Outdated Information

Training data has a cutoff date. Guidelines change, drug interactions are discovered, new variants emerge. A model has no way to know that recommendations shifted six months ago.

4 Plausible but Wrong

One early tester found MedGemma generated a "normal" interpretation for a chest X-ray with obvious tuberculosis findings. The output was grammatically perfect, appropriately formatted, and completely wrong.

Benchmark Scores Don't Equal Clinical Competence

A model scoring 87% on MedQA (multiple-choice medical questions) isn't "87% as good as a doctor." MedQA measures pattern recognition and fact recall. Success does not require:

The Hallucination Problem

All language models sometimes generate confident, fluent, and completely wrong outputs. This is an intrinsic architectural limitation, not a bug to be fixed.

In medical contexts, hallucination could mean:

What Local Models Can Reasonably Do

Appropriate Uses

  • Draft content for human review
  • Answer general medical questions (consumer health level)
  • Summarization with verification
  • Development and prototyping
  • Education and understanding AI behavior

Inappropriate Uses

  • Patient-specific diagnostic recommendations
  • Replacing clinical decision-making workflows
  • Operating autonomously in patient-facing applications
  • Handling real PHI without compliance infrastructure
  • Informing care without independent verification

Resources

Model Repositories

Technical Documentation

Medical AI Evaluation

Community