The complete guide to running large language models on your own hardware. Private, fast, offline — no API keys required.
Prompts, files, and conversations never leave your device. No third-party servers, no data collection, no terms of service to worry about. Ideal for sensitive financial reports, legal documents, and customer records.
Eliminate per-token API fees. After the initial hardware investment, every inference is free. Heavy users see payback within weeks versus cloud API pricing.
Work without internet connectivity. On planes, in restricted networks, or in air-gapped environments. Your AI assistant is always available.
Fine-tune models with domain-specific data. Adjust parameters, quantization levels, and context windows. No vendor lock-in, no rate limits, no usage tracking.
No network round-trips. Local inference on good hardware delivers near-instant responses. Especially noticeable with smaller models on Apple Silicon or modern GPUs.
Keep all data processing within your own infrastructure. Self-hosted models simplify regulatory compliance for EU businesses and privacy-conscious organizations.
| Model | Parameters | Best For | Context | VRAM (Q4) | License |
|---|---|---|---|---|---|
| Qwen 3.5 Alibaba |
122B (10B active) | General Coding | 128K | ~35 GB | Apache 2.0 |
| DeepSeek V3.2 DeepSeek |
671B (37B active) | Coding Reasoning | 128K | ~350 GB | MIT / DeepSeek |
| Llama 4 Scout Meta |
109B (17B active) | General Long Context | 10M | ~30 GB | Llama License |
| Mistral Large 3 Mistral AI |
675B (41B active) | Reasoning Multilingual | 128K | ~350 GB | Apache 2.0 |
| GLM-5 Reasoning Z AI |
Undisclosed | Reasoning Coding | 203K | Varies | Open License |
| Kimi K2.5 Moonshot AI |
MoE architecture | Coding Agent Tasks | 256K | Varies | Open License |
| Gemma 3 |
1B – 27B | On-Device Multimodal | 128K | 1–16 GB | Gemma License |
| Phi-4 Microsoft |
14B | Reasoning Efficiency | 16K | ~8 GB | MIT |
| MiMo-V2-Flash Xiaomi |
MoE (smaller) | Coding Speed | 128K | ~20 GB | Open License |
| Devstral-24B Mistral |
24B | Coding Agent Tasks | 128K | ~14 GB | Apache 2.0 |
Google's lightweight series optimized for on-device usage. Multimodal capabilities from 4B upward. Great first model for experimentation. Runs comfortably on 8 GB VRAM.
Microsoft's small language model that punches far above its weight on reasoning benchmarks. Exceptional for its size. Ideal when you want quality from minimal hardware.
Alibaba's compact model often recommended as the default for Ollama setups. Good balance of performance and resource usage. Works well for chat, summarization, and light coding tasks.
Only 10B parameters active per token, so it runs on a MacBook with 64 GB RAM despite its total size. Currently one of the strongest open MoE models available for local deployment.
Mistral's dedicated coding model. Fits in 32 GB machines with quantization. A proven choice for OpenClaw agent setups and code-focused workflows.
A distilled version of DeepSeek's reasoning model. Specialized in chain-of-thought reasoning and step-by-step problem solving. Needs a 48+ GB VRAM setup for comfortable inference.
The model that shook the AI world. 671B total parameters with only 37B active per token thanks to MoE. Requires multiple high-end GPUs or 350+ GB VRAM even quantized. Typically accessed via API for most users.
Europe's counterweight to US and Chinese frontier models. 41B active parameters, 128K context, supports 80+ languages. Apache 2.0 license. Requires enterprise-grade GPU infrastructure.
The largest AI model ever released for free. 1 trillion parameters with a 1 million token context window. Available on OpenRouter since March 2026. Not practical for local deployment on consumer hardware.
CLI-first tool that runs, manages, and serves models with minimal friction. Pull a model, run it, script it into a pipeline. Built on llama.cpp under the hood. Ollama's background daemon keeps models warm in VRAM between requests.
Desktop app with a polished GUI for browsing, downloading, and chatting with models. Connects directly to Hugging Face for 100,000+ models. No terminal knowledge needed. Also offers an OpenAI-compatible local server and CLI.
The foundational C/C++ inference engine that Ollama and LM Studio are built on. Maximum performance, full control over quantization, memory allocation, and hardware optimization. Supports CPU, CUDA, Metal, Vulkan, and SYCL.
High-throughput inference server optimized for production workloads. Features PagedAttention for efficient memory management, continuous batching, and support for multiple concurrent users. The go-to for serving models at scale.
HuggingFace's production-ready inference server. Deep integration with the HuggingFace ecosystem, optimized for transformer models, and supports features like speculative decoding and quantization out of the box.
Apple's machine learning framework optimized for their unified memory architecture. If you're on a Mac with Apple Silicon, MLX can leverage the full unified RAM as VRAM, giving you access to much larger models than discrete GPUs allow.
Quantization reduces model precision to shrink memory requirements. Q4_K_M is the most common default — it cuts memory by 75% versus full precision with minimal quality degradation. For a 70B model: FP16 needs ~140 GB, but Q4 only needs ~35 GB. Always add 10–20% overhead for KV cache, activations, and framework overhead on top of the base model size.
Formerly Clawd Bot — an agent framework that orchestrates AI workflows. Connects to Ollama, LM Studio, or vLLM locally, or to cloud LLMs via OAuth. Integrates with Telegram, WhatsApp, Discord, and more. Over 160K GitHub stars.
Creates encrypted peer-to-peer connections between your devices using WireGuard. Install on your LLM server and your phone/laptop — instant private access. Tailscale Funnel can also expose services publicly.
Routes traffic through Cloudflare's edge network via a local daemon. No inbound ports needed. Integrates with Cloudflare's Zero Trust platform for authentication and access controls. Free for up to 50 users.
The classic developer tunneling tool. One command to get a public HTTPS URL pointing at your local Ollama or LM Studio server. Built-in request inspection and replay for debugging. Ideal for quick testing.
Self-hosted ChatGPT-style interface that connects to Ollama or any OpenAI-compatible API. Multi-user support, conversation history, RAG integration, and model switching. Access from any browser on your network.
Fully self-hosted reverse proxy solutions. FRP has 100K+ GitHub stars and supports HTTP, TCP, and UDP. Pangolin adds a dashboard UI, identity management, and automatic SSL certificates. Zero vendor dependency.
Never expose your Ollama or LM Studio API without an auth layer. Use API keys, reverse proxy auth, or Tailscale's built-in ACLs. A bare API on a public IP will be found by scanners within hours.
Use HTTPS everywhere. Tunneling solutions like Tailscale and Cloudflare handle this automatically. For manual setups, use Let's Encrypt certificates with a reverse proxy like Caddy or Nginx.
Prompt injection is easier against open-weight models because attackers can study the model directly. Implement input validation, length limits, and content filtering before prompts reach your model.
Even for personal use, add rate limits. A runaway script or compromised client can saturate your GPU with requests. Use Nginx rate limiting or Cloudflare's built-in controls.
Keep logs of API requests, token usage, and unusual patterns. Tools like Grafana and Prometheus can track inference load, memory usage, and detect anomalies in access patterns.
Aggressively quantized or very small models are more susceptible to prompt injection and jailbreaking. For anything exposed to the internet, prefer 24B+ models that can better follow safety instructions.
Configure Ollama and LM Studio to listen on 127.0.0.1 only. Use a reverse proxy or tunnel for remote access rather than binding directly to 0.0.0.0. This is your first line of defense.
Run your inference server in Docker containers. This limits the blast radius if something goes wrong and makes it easier to manage dependencies, updates, and resource allocation.
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3:8b
>>> What is quantum computing?
curl http://localhost:11434/api/chat -d '{"model":"qwen3:8b","messages":[{"role":"user","content":"hello"}]}'
http://localhost:1234/v1/chat/completions
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && make
./llama-server -m model.gguf --port 8080 -ngl 99