Running AI Locally

Cloud AI is convenient, but sometimes you want AI that runs entirely on your own hardware. Privacy, cost, offline access—there are many reasons to go local.

Why Run AI Locally?

1. Privacy

Your data never leaves your machine.

Sensitive documents stay secure
No third-party data sharing
Comply with strict data policies

2. Cost Control

No per-token fees.

Pay once for hardware
Unlimited usage
Predictable costs

3. Offline Access

Works without internet.

Remote locations
Airplane mode
Unreliable connections

4. Customization

Full control over the setup.

Choose your model
Fine-tune freely
Modify as needed

5. Speed (Sometimes)

No network latency.

Faster for small models
Consistent response times

Hardware Requirements

CPU-Only

For: Testing, light use Works: Llama 3 8B (slow), Phi-3 (usable) Reality: Painfully slow for serious use

Gaming GPU (RTX 3080/4090)

VRAM: 10-24GB Works: 7B-13B models well, 30B models tight Cost: $500-2000 for the GPU Reality: Sweet spot for hobbyists

Multiple Consumer GPUs

VRAM: 24-48GB combined Works: Up to ~70B models Complexity: Need software that supports multi-GPU Reality: More hassle than single powerful GPU

Professional GPUs (A100, H100)

VRAM: 40-80GB Works: Any open model Cost: $10,000-30,000 (or rent for $2-5/hour) Reality: For serious production use

Apple Silicon (M1/M2/M3)

Unified Memory: 16-192GB Works: Good balance of model size and speed Reality: Surprisingly capable, good developer experience

Getting Started: Ollama

Ollama is the easiest way to run local AI.

Installation

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from ollama.ai

Running Models

# Download and run Llama 3
ollama run llama3

# Run Mistral
ollama run mistral

# Run a specific size
ollama run llama3:70b

API Access

Ollama runs a local server:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?"
}'

Other Local Tools

LM Studio

GUI for running models
Easy model downloading
Good for beginners

llama.cpp

Efficient inference engine
Supports quantization
Command line focused

Text Generation Web UI

Feature-rich web interface
Many model formats
Lots of settings

vLLM

Production-focused
High throughput
OpenAI-compatible API

Quantization: Fit Bigger Models

Quantization reduces precision to save memory:

Quantization	Memory per Billion Params	Quality Impact
FP16 (full)	~2GB	Baseline
8-bit	~1GB	Minimal
4-bit (Q4)	~0.5GB	Small
2-bit	~0.25GB	Noticeable

Practical impact:

70B model at FP16: ~140GB
70B model at Q4: ~35GB (runs on high-end consumer hardware!)

Performance Tips

Memory Optimization

Close other apps
Use quantized models
Reduce context length

Speed Optimization

GPU offloading (as much as fits)
Batch requests when possible
Use speculative decoding

Model Selection

Start with 7B models
Move to 13B-30B if hardware allows
70B only if you have serious hardware

Realistic Expectations

What Works Well

7B models on gaming hardware
Single-user applications
Privacy-sensitive tasks
Offline development

What's Challenging

Multiple concurrent users
Matching GPT-4 quality
Video/audio models locally
Training (vs inference)

Cost Comparison

Scenario: 100,000 queries/month

Cloud (GPT-4o mini):

~$50-100/month

Local (RTX 4090 + electricity):

Hardware: $2000 upfront
Electricity: ~$20/month
Break-even: ~2 years

Decision factors:

Privacy requirements
Usage volume
Latency needs
Model capability needs

The Bottom Line

Local AI is increasingly practical:

Tools are much easier than a year ago
Consumer hardware runs 7B models well
Quality gap with cloud is narrowing

Good for:

Privacy-sensitive applications
High-volume cost optimization
Offline scenarios
Learning and experimentation

Not ideal for:

Needing absolute best quality
Limited hardware budget
Video/audio processing

Next up: The Future of AI — Where this is all heading

Running AI Locally

Running AI Locally

Why Run AI Locally?

1. Privacy

2. Cost Control

3. Offline Access

4. Customization

5. Speed (Sometimes)

Hardware Requirements

CPU-Only

Gaming GPU (RTX 3080/4090)

Multiple Consumer GPUs

Professional GPUs (A100, H100)

Apple Silicon (M1/M2/M3)

Getting Started: Ollama

Installation

Running Models

API Access

Other Local Tools

LM Studio

llama.cpp

Text Generation Web UI

vLLM

Quantization: Fit Bigger Models

Performance Tips

Memory Optimization

Speed Optimization

Model Selection

Realistic Expectations

What Works Well

What's Challenging

Cost Comparison

The Bottom Line

Enjoying the course?

Discussion

Course Outline