Everything you need to know about running LLMs locally—from hardware requirements to software tools to use cases for private, offline AI.
Why Run LLMs Locally?
While cloud APIs are convenient, there are compelling reasons to run AI models on your own hardware: privacy, cost control, offline capability, and customization freedom.
Benefits and Trade-offs
Why Local?
| Benefit | Description |
|---|
| Privacy | Data never leaves your machine |
| Cost | No per-token fees |
| Offline | Works without internet |
| Speed | No network latency |
| Control | Full customization |
| Learning | Understand how it works |
Trade-offs
| Consideration | Cloud | Local |
|---|
| Model quality | Best models | Smaller, capable |
| Setup complexity | None | Some required |
| Hardware cost | None | Significant upfront |
| Ongoing cost | Per-use | Electricity only |
| Updates | Automatic | Manual |
Hardware Requirements
By Model Size
| Model Size | Minimum GPU | Recommended |
|---|
| 7B (Q4) | 6GB VRAM | 8GB VRAM |
| 13B (Q4) | 10GB VRAM | 16GB VRAM |
| 33B (Q4) | 20GB VRAM | 24GB VRAM |
| 70B (Q4) | 40GB VRAM | 48GB+ VRAM |
| 180B+ | 80GB+ | Multiple GPUs |
GPU Options
| GPU | VRAM | Models Supported | Price |
|---|
| RTX 3060 | 12GB | 7B-13B | $300 |
| RTX 3090 | 24GB | Up to 33B | $800 |
| RTX 4070 | 12GB | 7B-13B (fast) | $550 |
| RTX 4080 | 16GB | Up to 13B | $1000 |
| RTX 4090 | 24GB | Up to 33B (fast) | $1600 |
| A100 | 40/80GB | All sizes | $10K+ |
Apple Silicon
| Chip | Unified Memory | Models |
|---|
| M1 | 8-16GB | 7B |
| M1 Pro/Max | 16-64GB | 7B-33B |
| M2 Pro/Max | 16-96GB | 7B-70B |
| M3 Pro/Max | 18-128GB | 7B-70B |
| M4 Pro/Max | 24-128GB | 7B-70B+ |
CPU-Only (Slower)
For models with no GPU:
- 32GB+ RAM recommended
- llama.cpp CPU mode
- Expect 5-20 tokens/second
- Still usable for many tasks
Software Tools
Ollama (Recommended Start)
# Install
curl https://ollama.ai/install.sh | sh
# Run a model
ollama run llama3.1
# API server
ollama serve
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Hello, world!"
}'
Pros: Easiest to use, just works
Cons: Less configuration options
llama.cpp
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1 # For NVIDIA
# Run model
./llama-server -m models/llama-3.1-8b-q4.gguf
# Use the API
curl http://localhost:8080/completion -d '{"prompt": "Hello"}'
Pros: Maximum performance, highly optimized
Cons: More technical setup
LM Studio
- GUI application
- Download models easily
- One-click inference
- OpenAI-compatible API
- Windows, Mac, Linux
Best for: Non-technical users
vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct
Pros: Production-grade, high throughput
Cons: Requires more setup
Model Selection
Best Open Models for Local
| Model | Size | Quality | Use Case |
|---|
| Llama 3.1 8B | 8B | Excellent | General chat |
| Llama 3.1 70B | 70B | Best open | When you have hardware |
| Mistral 7B | 7B | Very good | Efficiency |
| Phi-3 Mini | 3.8B | Great for size | Mobile/edge |
| Gemma 2 9B | 9B | Excellent | Google alternative |
| Qwen 2.5 | Various | Multilingual | Multiple languages |
| CodeLlama | Various | Code focus | Programming |
| DeepSeek Coder | Various | Code | Coding tasks |
Quantization Levels
| Quantization | Size Reduction | Quality Loss |
|---|
| FP16 | 2x | None |
| Q8_0 | 4x | Minimal |
| Q6_K | 5x | Very small |
| Q5_K_M | 6x | Small |
| Q4_K_M | 8x | Noticeable |
| Q3_K_M | 10x | More noticeable |
| Q2_K | 12x | Significant |
Where to Get Models
| Source | Description |
|---|
| Hugging Face | Primary source |
| Ollama Library | Pre-configured for Ollama |
| TheBloke | Quantized versions |
| LM Studio | Built-in downloads |
Use Cases
When Local Makes Sense
| Use Case | Why Local |
|---|
| Personal assistant | Privacy |
| Code completion | IDE integration |
| Document analysis | Sensitive data |
| Offline work | Internet not required |
| High volume | Cost savings |
| Experimentation | No rate limits |
When Cloud is Better
| Scenario | Why Cloud |
|---|
| Best quality needed | GPT-4, Claude 3 |
| Occasional use | Hardware not justified |
| Large context | Limited local memory |
| Multi-user | Server economics |
Performance Optimization
Tips
| Optimization | Impact |
|---|
| Use quantized models | Major memory savings |
| Batch requests | Higher throughput |
| Flash attention | Faster inference |
| GPU layer offloading | Better utilization |
| Temperature/top_p tuning | Quality vs speed |
Expected Performance
| Setup | 7B Q4 | 13B Q4 |
|---|
| RTX 3060 (12GB) | 40 t/s | 25 t/s |
| RTX 4090 (24GB) | 100 t/s | 70 t/s |
| M2 Pro (32GB) | 30 t/s | 20 t/s |
| M3 Max (128GB) | 50 t/s | 40 t/s |
Getting Started
Quick Start Path
| Step | Action |
|---|
| 1 | Check your hardware (GPU, RAM) |
| 2 | Install Ollama |
| 3 | Run ollama run llama3.1 |
| 4 | Try different models |
| 5 | Integrate with your tools |
Integration Options
| Integration | Method |
|---|
| IDE | Continue.dev, Codeium |
| Terminal | Shell scripts, CLI |
| Applications | OpenAI-compatible API |
| Custom code | Python libraries |
Future Trends
What's Coming
- More efficient models: Same quality, smaller size
- Better Apple Silicon: Optimized for M-series
- NPU acceleration: Windows AI PC support
- On-device fine-tuning: Personalization
- Hardware improvements: Consumer AI chips
"Local LLMs are becoming increasingly viable. With the right hardware and a quality open-source model, you can have a capable AI assistant that's completely private and costs nothing to run."
MadhuNetrAI: How AI is Transforming Diabetic Retinopathy Screening Across India
India's AI-powered diabetic eye screening tool is revolutionizing healthcare access, enabling early detection in rural areas where specialist doctors are scarce.