A→Z
A2ZAI
Back to AI 101
Lesson 14 of 15
intermediateapplications

Running AI Locally

AI that runs on your own hardware

5 min read
Share:

Running AI Locally

Cloud AI is convenient, but sometimes you want AI that runs entirely on your own hardware. Privacy, cost, offline access—there are many reasons to go local.

Why Run AI Locally?

1. Privacy

Your data never leaves your machine.

  • Sensitive documents stay secure
  • No third-party data sharing
  • Comply with strict data policies

2. Cost Control

No per-token fees.

  • Pay once for hardware
  • Unlimited usage
  • Predictable costs

3. Offline Access

Works without internet.

  • Remote locations
  • Airplane mode
  • Unreliable connections

4. Customization

Full control over the setup.

  • Choose your model
  • Fine-tune freely
  • Modify as needed

5. Speed (Sometimes)

No network latency.

  • Faster for small models
  • Consistent response times

Hardware Requirements

CPU-Only

For: Testing, light use Works: Llama 3 8B (slow), Phi-3 (usable) Reality: Painfully slow for serious use

Gaming GPU (RTX 3080/4090)

VRAM: 10-24GB Works: 7B-13B models well, 30B models tight Cost: $500-2000 for the GPU Reality: Sweet spot for hobbyists

Multiple Consumer GPUs

VRAM: 24-48GB combined Works: Up to ~70B models Complexity: Need software that supports multi-GPU Reality: More hassle than single powerful GPU

Professional GPUs (A100, H100)

VRAM: 40-80GB Works: Any open model Cost: $10,000-30,000 (or rent for $2-5/hour) Reality: For serious production use

Apple Silicon (M1/M2/M3)

Unified Memory: 16-192GB Works: Good balance of model size and speed Reality: Surprisingly capable, good developer experience

Getting Started: Ollama

Ollama is the easiest way to run local AI.

Installation

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from ollama.ai

Running Models

# Download and run Llama 3
ollama run llama3

# Run Mistral
ollama run mistral

# Run a specific size
ollama run llama3:70b

API Access

Ollama runs a local server:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?"
}'

Other Local Tools

LM Studio

  • GUI for running models
  • Easy model downloading
  • Good for beginners

llama.cpp

  • Efficient inference engine
  • Supports quantization
  • Command line focused

Text Generation Web UI

  • Feature-rich web interface
  • Many model formats
  • Lots of settings

vLLM

  • Production-focused
  • High throughput
  • OpenAI-compatible API

Quantization: Fit Bigger Models

Quantization reduces precision to save memory:

QuantizationMemory per Billion ParamsQuality Impact
FP16 (full)~2GBBaseline
8-bit~1GBMinimal
4-bit (Q4)~0.5GBSmall
2-bit~0.25GBNoticeable

Practical impact:

  • 70B model at FP16: ~140GB
  • 70B model at Q4: ~35GB (runs on high-end consumer hardware!)

Performance Tips

Memory Optimization

  • Close other apps
  • Use quantized models
  • Reduce context length

Speed Optimization

  • GPU offloading (as much as fits)
  • Batch requests when possible
  • Use speculative decoding

Model Selection

  • Start with 7B models
  • Move to 13B-30B if hardware allows
  • 70B only if you have serious hardware

Realistic Expectations

What Works Well

  • 7B models on gaming hardware
  • Single-user applications
  • Privacy-sensitive tasks
  • Offline development

What's Challenging

  • Multiple concurrent users
  • Matching GPT-4 quality
  • Video/audio models locally
  • Training (vs inference)

Cost Comparison

Scenario: 100,000 queries/month

Cloud (GPT-4o mini):

  • ~$50-100/month

Local (RTX 4090 + electricity):

  • Hardware: $2000 upfront
  • Electricity: ~$20/month
  • Break-even: ~2 years

Decision factors:

  • Privacy requirements
  • Usage volume
  • Latency needs
  • Model capability needs

The Bottom Line

Local AI is increasingly practical:

  • Tools are much easier than a year ago
  • Consumer hardware runs 7B models well
  • Quality gap with cloud is narrowing

Good for:

  • Privacy-sensitive applications
  • High-volume cost optimization
  • Offline scenarios
  • Learning and experimentation

Not ideal for:

  • Needing absolute best quality
  • Limited hardware budget
  • Video/audio processing

Next up: The Future of AI — Where this is all heading

Enjoying the course?

Get notified when we add new lessons and AI updates.

Free daily digest. No spam, unsubscribe anytime.

Discussion