Running Local LLMs — The Complete Guide

Why Go Local

Your hardware. Your data. Your rules.

Open-source LLMs have closed the gap with proprietary models. In 2026, running locally isn't a compromise — it's a practical choice many developers rely on daily.

🔒

Complete Privacy

Prompts, files, and conversations never leave your device. No third-party servers, no data collection, no terms of service to worry about. Ideal for sensitive financial reports, legal documents, and customer records.

💰

Zero Recurring Costs

Eliminate per-token API fees. After the initial hardware investment, every inference is free. Heavy users see payback within weeks versus cloud API pricing.

✈️

Offline Operation

Work without internet connectivity. On planes, in restricted networks, or in air-gapped environments. Your AI assistant is always available.

🔧

Full Customization

Fine-tune models with domain-specific data. Adjust parameters, quantization levels, and context windows. No vendor lock-in, no rate limits, no usage tracking.

⚡

Low Latency

No network round-trips. Local inference on good hardware delivers near-instant responses. Especially noticeable with smaller models on Apple Silicon or modern GPUs.

📜

GDPR Compliance

Keep all data processing within your own infrastructure. Self-hosted models simplify regulatory compliance for EU businesses and privacy-conscious organizations.

Models

Which LLM should you run?

There is no single "best" model — the right choice depends on your task, your hardware, and your privacy needs. Here are the top open-weight models as of March 2026, organized by what they do best.

Model	Parameters	Best For	Context	VRAM (Q4)	License
Qwen 3.5 Alibaba	122B (10B active)	General Coding	128K	~35 GB	Apache 2.0
DeepSeek V3.2 DeepSeek	671B (37B active)	Coding Reasoning	128K	~350 GB	MIT / DeepSeek
Llama 4 Scout Meta	109B (17B active)	General Long Context	10M	~30 GB	Llama License
Mistral Large 3 Mistral AI	675B (41B active)	Reasoning Multilingual	128K	~350 GB	Apache 2.0
GLM-5 Reasoning Z AI	Undisclosed	Reasoning Coding	203K	Varies	Open License
Kimi K2.5 Moonshot AI	MoE architecture	Coding Agent Tasks	256K	Varies	Open License
Gemma 3 Google	1B – 27B	On-Device Multimodal	128K	1–16 GB	Gemma License
Phi-4 Microsoft	14B	Reasoning Efficiency	16K	~8 GB	MIT
MiMo-V2-Flash Xiaomi	MoE (smaller)	Coding Speed	128K	~20 GB	Open License
Devstral-24B Mistral	24B	Coding Agent Tasks	128K	~14 GB	Apache 2.0

Gemma 3 — 12B

Google's lightweight series optimized for on-device usage. Multimodal capabilities from 4B upward. Great first model for experimentation. Runs comfortably on 8 GB VRAM.

GeneralMultimodalEfficient

Phi-4 — 14B

Microsoft's small language model that punches far above its weight on reasoning benchmarks. Exceptional for its size. Ideal when you want quality from minimal hardware.

ReasoningCompact

Qwen 3 — 8B

Alibaba's compact model often recommended as the default for Ollama setups. Good balance of performance and resource usage. Works well for chat, summarization, and light coding tasks.

CodingChat

Qwen 3.5 — 122B MoE

Only 10B parameters active per token, so it runs on a MacBook with 64 GB RAM despite its total size. Currently one of the strongest open MoE models available for local deployment.

GeneralCodingReasoning

Devstral-24B

Mistral's dedicated coding model. Fits in 32 GB machines with quantization. A proven choice for OpenClaw agent setups and code-focused workflows.

CodingAgents

DeepSeek R1 Distill 70B

A distilled version of DeepSeek's reasoning model. Specialized in chain-of-thought reasoning and step-by-step problem solving. Needs a 48+ GB VRAM setup for comfortable inference.

ReasoningMath

DeepSeek V3.2 — 671B

The model that shook the AI world. 671B total parameters with only 37B active per token thanks to MoE. Requires multiple high-end GPUs or 350+ GB VRAM even quantized. Typically accessed via API for most users.

CodingReasoningHeavy

Mistral Large 3 — 675B

Europe's counterweight to US and Chinese frontier models. 41B active parameters, 128K context, supports 80+ languages. Apache 2.0 license. Requires enterprise-grade GPU infrastructure.

ReasoningMultilingualHeavy

Hunter Alpha — 1T

The largest AI model ever released for free. 1 trillion parameters with a 1 million token context window. Available on OpenRouter since March 2026. Not practical for local deployment on consumer hardware.

FrontierCloud Only

Frameworks

The tools that make it work.

From one-command installers to bare-metal inference engines, these are the frameworks powering local LLM deployment in 2026.

🦙

Ollama

The Developer's Default

CLI-first tool that runs, manages, and serves models with minimal friction. Pull a model, run it, script it into a pipeline. Built on llama.cpp under the hood. Ollama's background daemon keeps models warm in VRAM between requests.

Ease of Use

★★★★

Customization

★★★

Performance

★★★★

Ecosystem

★★★★★

          # Install and run in 2 commands

          curl -fsSL https://ollama.com/install.sh | sh

          ollama run qwen3:8b

🖥️

LM Studio

The Visual Approach

Desktop app with a polished GUI for browsing, downloading, and chatting with models. Connects directly to Hugging Face for 100,000+ models. No terminal knowledge needed. Also offers an OpenAI-compatible local server and CLI.

Ease of Use

★★★★★

Customization

★★½

Performance

★★★★

Ecosystem

★★★½

          # Also has CLI support

          lms load qwen3:8b

          lms chat

⚙️

llama.cpp

The Engine Under Everything

The foundational C/C++ inference engine that Ollama and LM Studio are built on. Maximum performance, full control over quantization, memory allocation, and hardware optimization. Supports CPU, CUDA, Metal, Vulkan, and SYCL.

Ease of Use

★★

Customization

★★★★★

Performance

★★★★★

Ecosystem

★★★

          # Direct model download & serve

          llama-server -m model.gguf

          llama-cli -m model.gguf -p "Hello"

🚀

vLLM

Production-Grade Serving

High-throughput inference server optimized for production workloads. Features PagedAttention for efficient memory management, continuous batching, and support for multiple concurrent users. The go-to for serving models at scale.

Ease of Use

★★½

Customization

★★★★

Performance

★★★★★

Ecosystem

★★★★

          vllm serve qwen/Qwen3-8B \

            --port 8090 \

            --gpu-memory-utilization 0.95

🤗

HuggingFace TGI

Text Generation Inference

HuggingFace's production-ready inference server. Deep integration with the HuggingFace ecosystem, optimized for transformer models, and supports features like speculative decoding and quantization out of the box.

Ease of Use

★★★

Customization

★★★★

Performance

★★★★½

Ecosystem

★★★★

          docker run --gpus all \

            ghcr.io/huggingface/tgi \

            --model-id meta-llama/Llama-4-Scout

🍎

MLX

Apple Silicon Native

Apple's machine learning framework optimized for their unified memory architecture. If you're on a Mac with Apple Silicon, MLX can leverage the full unified RAM as VRAM, giving you access to much larger models than discrete GPUs allow.

Ease of Use

★★★

Customization

★★★½

Performance

★★★★

Ecosystem

★★½

          pip install mlx-lm

          mlx_lm.generate --model mlx-community/Qwen3-8B

Hardware

What do you actually need?

The key bottleneck is VRAM (or unified memory on Macs). A rule of thumb: at 4-bit quantization, each billion parameters needs about 0.5 GB of memory. Context windows add more on top of that.

🪶

Starter

~$0 – $500

8 GB RAM / VRAM minimum
Models: 7–8B parameters (Qwen 3 8B, Gemma 3 4B, Phi-4)
Works on integrated GPU or CPU-only
Expect 5–15 tokens/sec on CPU
Good for chat, summarization, light coding
Example: Any modern laptop, Raspberry Pi 5 (limited)

⚡

Sweet Spot

~$500 – $2,000

16–32 GB VRAM or unified memory
Models: 14B–32B (Devstral-24B, Qwen3-Coder:32B)
RTX 4070 Ti / 4080 or Mac Mini M4 (32 GB)
30–60+ tokens/sec on GPU
Handles real agent tasks, coding, RAG pipelines
Most practical tier for daily use

🔥

Enthusiast

~$2,000 – $8,000

48–128 GB unified / multi-GPU VRAM
Models: 70B–122B (Qwen 3.5, DeepSeek R1 70B)
RTX 5090 / dual 3090s / Mac Studio (64–128 GB)
Run frontier-class MoE models locally
Dual model setups possible (e.g., coder + reasoner)
"Zero cloud" configuration for all tasks

🏢

Enterprise

$8,000+

192 GB+ unified / multi-GPU clusters
Models: 671B+ (DeepSeek V3.2, Mistral Large 3)
Mac Studio 192–512 GB / H100/H200 multi-GPU rigs
Multi-agent enterprise setups
Cloud-grade quality without the cloud
AMD MI300X (192 GB) for single-card large models

Understanding Quantization

📦

FP16

Full Precision

2 bytes/param

→

📉

8-bit Quantized

1 byte/param

→

✅

Q4_K_M

Default in Ollama

0.5 bytes/param

→

⚠️

Q2 / Q3

Aggressive

Quality loss

Quantization reduces model precision to shrink memory requirements. Q4_K_M is the most common default — it cuts memory by 75% versus full precision with minimal quality degradation. For a 70B model: FP16 needs ~140 GB, but Q4 only needs ~35 GB. Always add 10–20% overhead for KV cache, activations, and framework overhead on top of the base model size.

Remote Access

Make your local LLM available anywhere.

Running models locally doesn't mean they have to stay on one machine. From mesh VPNs to reverse proxies, here's how to securely expose your local inference to the world — or just to your phone.

Agent Framework

OpenClaw

Formerly Clawd Bot — an agent framework that orchestrates AI workflows. Connects to Ollama, LM Studio, or vLLM locally, or to cloud LLMs via OAuth. Integrates with Telegram, WhatsApp, Discord, and more. Over 160K GitHub stars.

Strengths

Full agent capabilities

Multi-channel messaging

Custom Skills system

Tradeoffs

Needs 32K+ context

Small models struggle

Complex configuration

Mesh VPN

Tailscale

Creates encrypted peer-to-peer connections between your devices using WireGuard. Install on your LLM server and your phone/laptop — instant private access. Tailscale Funnel can also expose services publicly.

Strengths

End-to-end encryption

NAT traversal built-in

Free for personal use

Tradeoffs

Control server is closed

Per-user pricing at scale

Funnel still in beta

Reverse Tunnel

Cloudflare Tunnel

Routes traffic through Cloudflare's edge network via a local daemon. No inbound ports needed. Integrates with Cloudflare's Zero Trust platform for authentication and access controls. Free for up to 50 users.

Strengths

Free tier is generous

DDoS protection included

Custom domains

Tradeoffs

Cloudflare dependency

More initial setup

HTTP-focused

Dev Tunnel

ngrok

The classic developer tunneling tool. One command to get a public HTTPS URL pointing at your local Ollama or LM Studio server. Built-in request inspection and replay for debugging. Ideal for quick testing.

Strengths

Fastest to set up

HTTP inspector built-in

Great for webhooks

Tradeoffs

Free tier is limited

Closed source

No UDP support

Web Interface

Open-WebUI

Self-hosted ChatGPT-style interface that connects to Ollama or any OpenAI-compatible API. Multi-user support, conversation history, RAG integration, and model switching. Access from any browser on your network.

Strengths

Beautiful chat interface

Multi-user accounts

RAG & file upload

Tradeoffs

Extra service to maintain

Needs tunnel for WAN

Resource overhead

Self-Hosted Tunnel

FRP / Pangolin

Fully self-hosted reverse proxy solutions. FRP has 100K+ GitHub stars and supports HTTP, TCP, and UDP. Pangolin adds a dashboard UI, identity management, and automatic SSL certificates. Zero vendor dependency.

Strengths

Complete ownership

No vendor lock-in

Full protocol support

Tradeoffs

Needs a public VPS

Manual TLS setup

Maintenance burden

Architecture Overview

📱

Client

Phone / Laptop

→

🌐

Tunnel

Tailscale / CF / ngrok

→

🖥️

Web UI

Open-WebUI / OpenClaw

→

🦙

Ollama / vLLM

Inference Server

→

🧠

LLM

Qwen / Llama / etc.

Security

Keep your local setup safe.

Open-weight models mean attackers can see the weights and probe vulnerabilities without rate limits. Running locally is inherently more private, but exposing your setup remotely requires careful hardening.

🔑

Always Use Authentication

Never expose your Ollama or LM Studio API without an auth layer. Use API keys, reverse proxy auth, or Tailscale's built-in ACLs. A bare API on a public IP will be found by scanners within hours.

🔒

Encrypt All Traffic

Use HTTPS everywhere. Tunneling solutions like Tailscale and Cloudflare handle this automatically. For manual setups, use Let's Encrypt certificates with a reverse proxy like Caddy or Nginx.

🛡️

Validate Inputs

Prompt injection is easier against open-weight models because attackers can study the model directly. Implement input validation, length limits, and content filtering before prompts reach your model.

📊

Rate Limiting

Even for personal use, add rate limits. A runaway script or compromised client can saturate your GPU with requests. Use Nginx rate limiting or Cloudflare's built-in controls.

🔍

Monitor & Log

Keep logs of API requests, token usage, and unusual patterns. Tools like Grafana and Prometheus can track inference load, memory usage, and detect anomalies in access patterns.

🧱

Use Larger Models for Sensitive Tasks

Aggressively quantized or very small models are more susceptible to prompt injection and jailbreaking. For anything exposed to the internet, prefer 24B+ models that can better follow safety instructions.

🔄

Bind to Localhost by Default

Configure Ollama and LM Studio to listen on 127.0.0.1 only. Use a reverse proxy or tunnel for remote access rather than binding directly to 0.0.0.0. This is your first line of defense.

📦

Isolate with Containers

Run your inference server in Docker containers. This limits the blast radius if something goes wrong and makes it easier to manage dependencies, updates, and resource allocation.

Quick Start

Get running in 5 minutes.

Three paths depending on your comfort level. Each gets you from zero to chatting with a local LLM as fast as possible.

🟢 Beginner — Ollama

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model

ollama run qwen3:8b

Start chatting! That's it.

>>> What is quantum computing?

Use the API (optional)

curl http://localhost:11434/api/chat -d '{"model":"qwen3:8b","messages":[{"role":"user","content":"hello"}]}'

🔵 Visual — LM Studio

Download LM Studio from lmstudio.ai

Search for a model (e.g., "qwen3 8b")

Click Download, then open the Chat tab

Enable Local Server for API access

http://localhost:1234/v1/chat/completions

🟠 Advanced — llama.cpp

Clone and build

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && make

Download a GGUF model from Hugging Face

Run the server

./llama-server -m model.gguf --port 8080 -ngl 99

Open the built-in web UI at localhost:8080

Reference

Key terms explained.

Quick reference for the terminology you'll encounter when working with local LLMs.

GGUF: The standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. Replaced the older GGML format. Contains model weights, tokenizer, and metadata in a single file.
Quantization: Reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to shrink memory usage. Q4_K_M is the most popular balance of quality and size. Lower bits = smaller but less accurate.
MoE (Mixture of Experts): Architecture where only a subset of parameters are "active" per token. A 671B MoE model with 37B active parameters runs more like a 37B dense model in practice, but has the knowledge capacity of the full 671B.
VRAM: Video RAM on a GPU. The primary bottleneck for local LLM inference. Model weights must fit in VRAM for GPU-accelerated inference. On Macs, "unified memory" serves as both RAM and VRAM.
Context Window: The maximum number of tokens a model can process at once, including both input and output. Larger context windows need more memory via the KV cache. A 128K context window can handle ~100 pages of text.
KV Cache: Key-Value cache stores computed attention states during inference. Grows linearly with context length and adds to VRAM usage beyond the model weights themselves. Often the hidden memory cost people forget.
Tokens/sec (t/s): The speed at which a model generates text. ~30 t/s feels conversational. Above 50 t/s feels instant. Below 10 t/s feels slow. Speed depends on model size, hardware, and quantization level.
OpenAI-Compatible API: A REST API that follows OpenAI's /v1/chat/completions format. Both Ollama and LM Studio expose this, so you can swap your local model into any app built for the OpenAI API by changing the base URL.
RAG (Retrieval-Augmented Generation): A pattern where the LLM retrieves relevant documents before generating a response. Lets you "chat with your files" without fine-tuning the model. Open-WebUI and many frameworks support this natively.
GPU Offloading (-ngl): The number of model layers loaded onto the GPU. More layers on GPU = faster inference. If a model doesn't fully fit in VRAM, partial offloading runs some layers on CPU (slower but functional).

Run AI.Locally.

Complete Privacy

Zero Recurring Costs

Offline Operation

Full Customization

Low Latency

GDPR Compliance

Gemma 3 — 12B

Phi-4 — 14B

Qwen 3 — 8B

Qwen 3.5 — 122B MoE

Devstral-24B

DeepSeek R1 Distill 70B

DeepSeek V3.2 — 671B

Mistral Large 3 — 675B

Hunter Alpha — 1T

Ollama

LM Studio

llama.cpp

vLLM

HuggingFace TGI

MLX

Starter

Sweet Spot

Enthusiast

Enterprise

Understanding Quantization

OpenClaw

Tailscale

Cloudflare Tunnel

ngrok

Open-WebUI

FRP / Pangolin

Architecture Overview

Always Use Authentication

Encrypt All Traffic

Validate Inputs

Rate Limiting

Monitor & Log

Use Larger Models for Sensitive Tasks

Bind to Localhost by Default

Isolate with Containers

🟢 Beginner — Ollama

🔵 Visual — LM Studio

🟠 Advanced — llama.cpp

Run AI.
Locally.