Updated March 2026

Run AI.
Locally.

The complete guide to running large language models on your own hardware. Private, fast, offline — no API keys required.

Explore Models ↓ Quick Start
Your hardware. Your data. Your rules.
Open-source LLMs have closed the gap with proprietary models. In 2026, running locally isn't a compromise — it's a practical choice many developers rely on daily.
🔒

Complete Privacy

Prompts, files, and conversations never leave your device. No third-party servers, no data collection, no terms of service to worry about. Ideal for sensitive financial reports, legal documents, and customer records.

💰

Zero Recurring Costs

Eliminate per-token API fees. After the initial hardware investment, every inference is free. Heavy users see payback within weeks versus cloud API pricing.

✈️

Offline Operation

Work without internet connectivity. On planes, in restricted networks, or in air-gapped environments. Your AI assistant is always available.

🔧

Full Customization

Fine-tune models with domain-specific data. Adjust parameters, quantization levels, and context windows. No vendor lock-in, no rate limits, no usage tracking.

Low Latency

No network round-trips. Local inference on good hardware delivers near-instant responses. Especially noticeable with smaller models on Apple Silicon or modern GPUs.

📜

GDPR Compliance

Keep all data processing within your own infrastructure. Self-hosted models simplify regulatory compliance for EU businesses and privacy-conscious organizations.

Which LLM should you run?
There is no single "best" model — the right choice depends on your task, your hardware, and your privacy needs. Here are the top open-weight models as of March 2026, organized by what they do best.
Model Parameters Best For Context VRAM (Q4) License
Qwen 3.5
Alibaba
122B (10B active) General Coding 128K ~35 GB Apache 2.0
DeepSeek V3.2
DeepSeek
671B (37B active) Coding Reasoning 128K ~350 GB MIT / DeepSeek
Llama 4 Scout
Meta
109B (17B active) General Long Context 10M ~30 GB Llama License
Mistral Large 3
Mistral AI
675B (41B active) Reasoning Multilingual 128K ~350 GB Apache 2.0
GLM-5 Reasoning
Z AI
Undisclosed Reasoning Coding 203K Varies Open License
Kimi K2.5
Moonshot AI
MoE architecture Coding Agent Tasks 256K Varies Open License
Gemma 3
Google
1B – 27B On-Device Multimodal 128K 1–16 GB Gemma License
Phi-4
Microsoft
14B Reasoning Efficiency 16K ~8 GB MIT
MiMo-V2-Flash
Xiaomi
MoE (smaller) Coding Speed 128K ~20 GB Open License
Devstral-24B
Mistral
24B Coding Agent Tasks 128K ~14 GB Apache 2.0

Gemma 3 — 12B

Google's lightweight series optimized for on-device usage. Multimodal capabilities from 4B upward. Great first model for experimentation. Runs comfortably on 8 GB VRAM.

GeneralMultimodalEfficient

Phi-4 — 14B

Microsoft's small language model that punches far above its weight on reasoning benchmarks. Exceptional for its size. Ideal when you want quality from minimal hardware.

ReasoningCompact

Qwen 3 — 8B

Alibaba's compact model often recommended as the default for Ollama setups. Good balance of performance and resource usage. Works well for chat, summarization, and light coding tasks.

CodingChat

Qwen 3.5 — 122B MoE

Only 10B parameters active per token, so it runs on a MacBook with 64 GB RAM despite its total size. Currently one of the strongest open MoE models available for local deployment.

GeneralCodingReasoning

Devstral-24B

Mistral's dedicated coding model. Fits in 32 GB machines with quantization. A proven choice for OpenClaw agent setups and code-focused workflows.

CodingAgents

DeepSeek R1 Distill 70B

A distilled version of DeepSeek's reasoning model. Specialized in chain-of-thought reasoning and step-by-step problem solving. Needs a 48+ GB VRAM setup for comfortable inference.

ReasoningMath

DeepSeek V3.2 — 671B

The model that shook the AI world. 671B total parameters with only 37B active per token thanks to MoE. Requires multiple high-end GPUs or 350+ GB VRAM even quantized. Typically accessed via API for most users.

CodingReasoningHeavy

Mistral Large 3 — 675B

Europe's counterweight to US and Chinese frontier models. 41B active parameters, 128K context, supports 80+ languages. Apache 2.0 license. Requires enterprise-grade GPU infrastructure.

ReasoningMultilingualHeavy

Hunter Alpha — 1T

The largest AI model ever released for free. 1 trillion parameters with a 1 million token context window. Available on OpenRouter since March 2026. Not practical for local deployment on consumer hardware.

FrontierCloud Only
The tools that make it work.
From one-command installers to bare-metal inference engines, these are the frameworks powering local LLM deployment in 2026.
🦙

Ollama

The Developer's Default

CLI-first tool that runs, manages, and serves models with minimal friction. Pull a model, run it, script it into a pipeline. Built on llama.cpp under the hood. Ollama's background daemon keeps models warm in VRAM between requests.

Ease of Use
★★★★
Customization
★★★
Performance
★★★★
Ecosystem
★★★★★
# Install and run in 2 commands
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3:8b
🖥️

LM Studio

The Visual Approach

Desktop app with a polished GUI for browsing, downloading, and chatting with models. Connects directly to Hugging Face for 100,000+ models. No terminal knowledge needed. Also offers an OpenAI-compatible local server and CLI.

Ease of Use
★★★★★
Customization
★★½
Performance
★★★★
Ecosystem
★★★½
# Also has CLI support
lms load qwen3:8b
lms chat
⚙️

llama.cpp

The Engine Under Everything

The foundational C/C++ inference engine that Ollama and LM Studio are built on. Maximum performance, full control over quantization, memory allocation, and hardware optimization. Supports CPU, CUDA, Metal, Vulkan, and SYCL.

Ease of Use
★★
Customization
★★★★★
Performance
★★★★★
Ecosystem
★★★
# Direct model download & serve
llama-server -m model.gguf
llama-cli -m model.gguf -p "Hello"
🚀

vLLM

Production-Grade Serving

High-throughput inference server optimized for production workloads. Features PagedAttention for efficient memory management, continuous batching, and support for multiple concurrent users. The go-to for serving models at scale.

Ease of Use
★★½
Customization
★★★★
Performance
★★★★★
Ecosystem
★★★★
vllm serve qwen/Qwen3-8B \
  --port 8090 \
  --gpu-memory-utilization 0.95
🤗

HuggingFace TGI

Text Generation Inference

HuggingFace's production-ready inference server. Deep integration with the HuggingFace ecosystem, optimized for transformer models, and supports features like speculative decoding and quantization out of the box.

Ease of Use
★★★
Customization
★★★★
Performance
★★★★½
Ecosystem
★★★★
docker run --gpus all \
  ghcr.io/huggingface/tgi \
  --model-id meta-llama/Llama-4-Scout
🍎

MLX

Apple Silicon Native

Apple's machine learning framework optimized for their unified memory architecture. If you're on a Mac with Apple Silicon, MLX can leverage the full unified RAM as VRAM, giving you access to much larger models than discrete GPUs allow.

Ease of Use
★★★
Customization
★★★½
Performance
★★★★
Ecosystem
★★½
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-8B
What do you actually need?
The key bottleneck is VRAM (or unified memory on Macs). A rule of thumb: at 4-bit quantization, each billion parameters needs about 0.5 GB of memory. Context windows add more on top of that.
🪶

Starter

~$0 – $500
  • 8 GB RAM / VRAM minimum
  • Models: 7–8B parameters (Qwen 3 8B, Gemma 3 4B, Phi-4)
  • Works on integrated GPU or CPU-only
  • Expect 5–15 tokens/sec on CPU
  • Good for chat, summarization, light coding
  • Example: Any modern laptop, Raspberry Pi 5 (limited)
🔥

Enthusiast

~$2,000 – $8,000
  • 48–128 GB unified / multi-GPU VRAM
  • Models: 70B–122B (Qwen 3.5, DeepSeek R1 70B)
  • RTX 5090 / dual 3090s / Mac Studio (64–128 GB)
  • Run frontier-class MoE models locally
  • Dual model setups possible (e.g., coder + reasoner)
  • "Zero cloud" configuration for all tasks
🏢

Enterprise

$8,000+
  • 192 GB+ unified / multi-GPU clusters
  • Models: 671B+ (DeepSeek V3.2, Mistral Large 3)
  • Mac Studio 192–512 GB / H100/H200 multi-GPU rigs
  • Multi-agent enterprise setups
  • Cloud-grade quality without the cloud
  • AMD MI300X (192 GB) for single-card large models

Understanding Quantization

📦
FP16
Full Precision
2 bytes/param
📉
Q8
8-bit Quantized
1 byte/param
Q4_K_M
Default in Ollama
0.5 bytes/param
⚠️
Q2 / Q3
Aggressive
Quality loss

Quantization reduces model precision to shrink memory requirements. Q4_K_M is the most common default — it cuts memory by 75% versus full precision with minimal quality degradation. For a 70B model: FP16 needs ~140 GB, but Q4 only needs ~35 GB. Always add 10–20% overhead for KV cache, activations, and framework overhead on top of the base model size.

Make your local LLM available anywhere.
Running models locally doesn't mean they have to stay on one machine. From mesh VPNs to reverse proxies, here's how to securely expose your local inference to the world — or just to your phone.
Agent Framework

OpenClaw

Formerly Clawd Bot — an agent framework that orchestrates AI workflows. Connects to Ollama, LM Studio, or vLLM locally, or to cloud LLMs via OAuth. Integrates with Telegram, WhatsApp, Discord, and more. Over 160K GitHub stars.

Strengths
Full agent capabilities
Multi-channel messaging
Custom Skills system
Tradeoffs
Needs 32K+ context
Small models struggle
Complex configuration
Mesh VPN

Tailscale

Creates encrypted peer-to-peer connections between your devices using WireGuard. Install on your LLM server and your phone/laptop — instant private access. Tailscale Funnel can also expose services publicly.

Strengths
End-to-end encryption
NAT traversal built-in
Free for personal use
Tradeoffs
Control server is closed
Per-user pricing at scale
Funnel still in beta
Reverse Tunnel

Cloudflare Tunnel

Routes traffic through Cloudflare's edge network via a local daemon. No inbound ports needed. Integrates with Cloudflare's Zero Trust platform for authentication and access controls. Free for up to 50 users.

Strengths
Free tier is generous
DDoS protection included
Custom domains
Tradeoffs
Cloudflare dependency
More initial setup
HTTP-focused
Dev Tunnel

ngrok

The classic developer tunneling tool. One command to get a public HTTPS URL pointing at your local Ollama or LM Studio server. Built-in request inspection and replay for debugging. Ideal for quick testing.

Strengths
Fastest to set up
HTTP inspector built-in
Great for webhooks
Tradeoffs
Free tier is limited
Closed source
No UDP support
Web Interface

Open-WebUI

Self-hosted ChatGPT-style interface that connects to Ollama or any OpenAI-compatible API. Multi-user support, conversation history, RAG integration, and model switching. Access from any browser on your network.

Strengths
Beautiful chat interface
Multi-user accounts
RAG & file upload
Tradeoffs
Extra service to maintain
Needs tunnel for WAN
Resource overhead
Self-Hosted Tunnel

FRP / Pangolin

Fully self-hosted reverse proxy solutions. FRP has 100K+ GitHub stars and supports HTTP, TCP, and UDP. Pangolin adds a dashboard UI, identity management, and automatic SSL certificates. Zero vendor dependency.

Strengths
Complete ownership
No vendor lock-in
Full protocol support
Tradeoffs
Needs a public VPS
Manual TLS setup
Maintenance burden

Architecture Overview

📱
Client
Phone / Laptop
🌐
Tunnel
Tailscale / CF / ngrok
🖥️
Web UI
Open-WebUI / OpenClaw
🦙
Ollama / vLLM
Inference Server
🧠
LLM
Qwen / Llama / etc.
Keep your local setup safe.
Open-weight models mean attackers can see the weights and probe vulnerabilities without rate limits. Running locally is inherently more private, but exposing your setup remotely requires careful hardening.
🔑

Always Use Authentication

Never expose your Ollama or LM Studio API without an auth layer. Use API keys, reverse proxy auth, or Tailscale's built-in ACLs. A bare API on a public IP will be found by scanners within hours.

🔒

Encrypt All Traffic

Use HTTPS everywhere. Tunneling solutions like Tailscale and Cloudflare handle this automatically. For manual setups, use Let's Encrypt certificates with a reverse proxy like Caddy or Nginx.

🛡️

Validate Inputs

Prompt injection is easier against open-weight models because attackers can study the model directly. Implement input validation, length limits, and content filtering before prompts reach your model.

📊

Rate Limiting

Even for personal use, add rate limits. A runaway script or compromised client can saturate your GPU with requests. Use Nginx rate limiting or Cloudflare's built-in controls.

🔍

Monitor & Log

Keep logs of API requests, token usage, and unusual patterns. Tools like Grafana and Prometheus can track inference load, memory usage, and detect anomalies in access patterns.

🧱

Use Larger Models for Sensitive Tasks

Aggressively quantized or very small models are more susceptible to prompt injection and jailbreaking. For anything exposed to the internet, prefer 24B+ models that can better follow safety instructions.

🔄

Bind to Localhost by Default

Configure Ollama and LM Studio to listen on 127.0.0.1 only. Use a reverse proxy or tunnel for remote access rather than binding directly to 0.0.0.0. This is your first line of defense.

📦

Isolate with Containers

Run your inference server in Docker containers. This limits the blast radius if something goes wrong and makes it easier to manage dependencies, updates, and resource allocation.

Get running in 5 minutes.
Three paths depending on your comfort level. Each gets you from zero to chatting with a local LLM as fast as possible.

🟢 Beginner — Ollama

1
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
2
Pull and run a model
ollama run qwen3:8b
3
Start chatting! That's it.
>>> What is quantum computing?
4
Use the API (optional)
curl http://localhost:11434/api/chat -d '{"model":"qwen3:8b","messages":[{"role":"user","content":"hello"}]}'

🔵 Visual — LM Studio

1
Download LM Studio from lmstudio.ai
2
Search for a model (e.g., "qwen3 8b")
3
Click Download, then open the Chat tab
4
Enable Local Server for API access
http://localhost:1234/v1/chat/completions

🟠 Advanced — llama.cpp

1
Clone and build
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && make
2
Download a GGUF model from Hugging Face
3
Run the server
./llama-server -m model.gguf --port 8080 -ngl 99
4
Open the built-in web UI at localhost:8080
Key terms explained.
Quick reference for the terminology you'll encounter when working with local LLMs.
GGUF
The standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. Replaced the older GGML format. Contains model weights, tokenizer, and metadata in a single file.
Quantization
Reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to shrink memory usage. Q4_K_M is the most popular balance of quality and size. Lower bits = smaller but less accurate.
MoE (Mixture of Experts)
Architecture where only a subset of parameters are "active" per token. A 671B MoE model with 37B active parameters runs more like a 37B dense model in practice, but has the knowledge capacity of the full 671B.
VRAM
Video RAM on a GPU. The primary bottleneck for local LLM inference. Model weights must fit in VRAM for GPU-accelerated inference. On Macs, "unified memory" serves as both RAM and VRAM.
Context Window
The maximum number of tokens a model can process at once, including both input and output. Larger context windows need more memory via the KV cache. A 128K context window can handle ~100 pages of text.
KV Cache
Key-Value cache stores computed attention states during inference. Grows linearly with context length and adds to VRAM usage beyond the model weights themselves. Often the hidden memory cost people forget.
Tokens/sec (t/s)
The speed at which a model generates text. ~30 t/s feels conversational. Above 50 t/s feels instant. Below 10 t/s feels slow. Speed depends on model size, hardware, and quantization level.
OpenAI-Compatible API
A REST API that follows OpenAI's /v1/chat/completions format. Both Ollama and LM Studio expose this, so you can swap your local model into any app built for the OpenAI API by changing the base URL.
RAG (Retrieval-Augmented Generation)
A pattern where the LLM retrieves relevant documents before generating a response. Lets you "chat with your files" without fine-tuning the model. Open-WebUI and many frameworks support this natively.
GPU Offloading (-ngl)
The number of model layers loaded onto the GPU. More layers on GPU = faster inference. If a model doesn't fully fit in VRAM, partial offloading runs some layers on CPU (slower but functional).