Running AI Locally
Cloud AI is convenient, but sometimes you want AI that runs entirely on your own hardware. Privacy, cost, offline access—there are many reasons to go local.
Why Run AI Locally?
1. Privacy
Your data never leaves your machine.
- Sensitive documents stay secure
- No third-party data sharing
- Comply with strict data policies
2. Cost Control
No per-token fees.
- Pay once for hardware
- Unlimited usage
- Predictable costs
3. Offline Access
Works without internet.
- Remote locations
- Airplane mode
- Unreliable connections
4. Customization
Full control over the setup.
- Choose your model
- Fine-tune freely
- Modify as needed
5. Speed (Sometimes)
No network latency.
- Faster for small models
- Consistent response times
Hardware Requirements
CPU-Only
For: Testing, light use Works: Llama 3 8B (slow), Phi-3 (usable) Reality: Painfully slow for serious use
Gaming GPU (RTX 3080/4090)
VRAM: 10-24GB Works: 7B-13B models well, 30B models tight Cost: $500-2000 for the GPU Reality: Sweet spot for hobbyists
Multiple Consumer GPUs
VRAM: 24-48GB combined Works: Up to ~70B models Complexity: Need software that supports multi-GPU Reality: More hassle than single powerful GPU
Professional GPUs (A100, H100)
VRAM: 40-80GB Works: Any open model Cost: $10,000-30,000 (or rent for $2-5/hour) Reality: For serious production use
Apple Silicon (M1/M2/M3)
Unified Memory: 16-192GB Works: Good balance of model size and speed Reality: Surprisingly capable, good developer experience
Getting Started: Ollama
Ollama is the easiest way to run local AI.
Installation
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from ollama.ai
Running Models
# Download and run Llama 3
ollama run llama3
# Run Mistral
ollama run mistral
# Run a specific size
ollama run llama3:70b
API Access
Ollama runs a local server:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?"
}'
Other Local Tools
LM Studio
- GUI for running models
- Easy model downloading
- Good for beginners
llama.cpp
- Efficient inference engine
- Supports quantization
- Command line focused
Text Generation Web UI
- Feature-rich web interface
- Many model formats
- Lots of settings
vLLM
- Production-focused
- High throughput
- OpenAI-compatible API
Quantization: Fit Bigger Models
Quantization reduces precision to save memory:
| Quantization | Memory per Billion Params | Quality Impact |
|---|---|---|
| FP16 (full) | ~2GB | Baseline |
| 8-bit | ~1GB | Minimal |
| 4-bit (Q4) | ~0.5GB | Small |
| 2-bit | ~0.25GB | Noticeable |
Practical impact:
- 70B model at FP16: ~140GB
- 70B model at Q4: ~35GB (runs on high-end consumer hardware!)
Performance Tips
Memory Optimization
- Close other apps
- Use quantized models
- Reduce context length
Speed Optimization
- GPU offloading (as much as fits)
- Batch requests when possible
- Use speculative decoding
Model Selection
- Start with 7B models
- Move to 13B-30B if hardware allows
- 70B only if you have serious hardware
Realistic Expectations
What Works Well
- 7B models on gaming hardware
- Single-user applications
- Privacy-sensitive tasks
- Offline development
What's Challenging
- Multiple concurrent users
- Matching GPT-4 quality
- Video/audio models locally
- Training (vs inference)
Cost Comparison
Scenario: 100,000 queries/month
Cloud (GPT-4o mini):
- ~$50-100/month
Local (RTX 4090 + electricity):
- Hardware: $2000 upfront
- Electricity: ~$20/month
- Break-even: ~2 years
Decision factors:
- Privacy requirements
- Usage volume
- Latency needs
- Model capability needs
The Bottom Line
Local AI is increasingly practical:
- Tools are much easier than a year ago
- Consumer hardware runs 7B models well
- Quality gap with cloud is narrowing
Good for:
- Privacy-sensitive applications
- High-volume cost optimization
- Offline scenarios
- Learning and experimentation
Not ideal for:
- Needing absolute best quality
- Limited hardware budget
- Video/audio processing
Next up: The Future of AI — Where this is all heading