Why run OpenClaw with a local LLM (and when not to)?
In our companion guide we showed how to self-host OpenClaw on a ₹99 VPS using the Claude or OpenAI API. That setup works for 80% of users — cheap, easy, fast.
But there's another mode: running the LLM brain locally too. This means OpenClaw isn't just self-hosted — *nothing* about your AI assistant leaves your server. No API calls to Anthropic. No tokens billed by OpenAI. Your conversations, files, and tool use all stay on hardware you control.
Pick local LLM mode if:
- •You handle sensitive data (medical, legal, financial) and can't send it to third-party APIs
- •You want to experiment with custom fine-tuned models
- •You're rate-limited or banned from a major API provider
- •You want predictable cost — pay only for GPU compute, not per-token
Stick with API mode if:
- •You want the best raw model quality (frontier models like Claude Sonnet 4.5 and GPT-4o still outperform open-source models on complex reasoning)
- •You don't need GPU-grade compute 24/7
- •You'd rather not babysit GPU server costs
For most personal use, API mode is more practical. Local LLM mode is for privacy maximalists and ML enthusiasts.
---
What hardware do you actually need?
The model size determines the GPU requirements. Here's the practical map for 2026:
| Model | Params | Min VRAM (4-bit) | Min VRAM (8-bit) | Quality | Best GPU |
|---|---|---|---|---|---|
| Llama 3.2 3B | 3B | 4 GB | 6 GB | Good for simple tasks | RTX 3060 (12GB) |
| Llama 3.1 8B | 8B | 8 GB | 12 GB | Solid all-rounder | RTX 3090 (24GB) |
| Mistral Small 22B | 22B | 16 GB | 24 GB | Strong reasoning | RTX 3090 / 4090 |
| Llama 3.3 70B | 70B | 40 GB | 70 GB | GPT-4 class | A100 80GB |
| Qwen 2.5 32B | 32B | 24 GB | 40 GB | Excellent code + math | RTX 4090 / A100 |
Sweet spot for OpenClaw assistant workloads: Llama 3.1 8B or Mistral Small 22B on a single RTX 3090. Strong enough to handle agent reasoning, browser automation planning, multi-step tool use. Fast enough that the assistant feels responsive (~30-60 tokens/sec).
On AIC Cloud, RTX 3090 (24 GB VRAM) is available from ₹27.74/hour, billed per minute. If you only run the agent during your working hours (~8 hrs/day, ~240 hrs/month), that's roughly ₹6,658/month — vs an always-on RTX 3090 setup at ~₹19,973/month.
For 24×7 always-on, consider dedicated GPU hosting (see plans on /cloud-gpu) — typically more cost-effective than hourly above 12 hours/day average.
---
Step 1: Provision a GPU instance
Sign up at aiccloud.in, top up via UPI (₹500 covers about 40 hours of RTX 3090 time — plenty to test). Then:
1. Go to Cloud GPU in your dashboard
2. Pick RTX 3090 (24 GB VRAM) for a good all-rounder, or RTX 4090 if you want extra headroom
3. Choose Ubuntu 22.04 + CUDA 12.x as the template
4. Click Deploy — instance is ready in ~60 seconds with NVIDIA drivers pre-installed
Get the SSH credentials from your dashboard and connect:
ssh root@YOUR_GPU_INSTANCE_IP
Confirm the GPU is visible:
nvidia-smi
You should see your RTX 3090 listed with 24 GB VRAM available.
---
Step 2: Install Ollama (the easiest local LLM runtime)
Ollama is the simplest way to serve local LLMs. It handles downloading, quantising, and serving models with a single command.
curl -fsSL https://ollama.com/install.sh | sh
systemctl enable --now ollama
Pull a model — let's start with Llama 3.1 8B (sweet spot for agent workloads):
ollama pull llama3.1:8b
Test that it works:
ollama run llama3.1:8b "Hello, are you working?"
You should see a streaming response in ~1-2 seconds.
Ollama exposes an OpenAI-compatible API on http://localhost:11434/v1 — this is what makes integration with OpenClaw trivial.
---
Step 3: Install OpenClaw
Same flow as the API-mode guide:
apt update && apt install -y git docker.io
systemctl enable --now docker
docker pull openclaw/openclaw:latest
docker run -d --name openclaw --restart unless-stopped \
--network host \
-v ~/.openclaw:/root/.openclaw \
openclaw/openclaw:latest
We use --network host so OpenClaw can reach Ollama on localhost:11434 without exposing the LLM port externally.
---
Step 4: Point OpenClaw at your local LLM
Edit ~/.openclaw/config.yaml to point at Ollama's OpenAI-compatible endpoint:
llm:
provider: openai # Ollama speaks OpenAI's API protocol
model: llama3.1:8b # or mistral, qwen2.5, etc.
api_key: ollama # any non-empty string works
base_url: http://localhost:11434/v1
max_tokens: 4096
temperature: 0.4
Restart OpenClaw:
docker restart openclaw
Now every reasoning step OpenClaw makes hits your own GPU instead of Anthropic or OpenAI. Your conversations never leave this server.
---
Step 5: Connect your messaging app
Same as API-mode setup — Telegram is fastest:
1. Open Telegram, message @BotFather, send /newbot, follow prompts
2. Copy the bot token
3. Paste into OpenClaw config under integrations.telegram.token
4. Restart OpenClaw
Send your bot a message to verify it routes through your local Llama.
---
Step 6: Tune for speed and quality
Llama 3.1 8B on RTX 3090 typically outputs 50-80 tokens/second — fast enough that the agent feels snappy. A few tweaks help:
Use quantised models
ollama pull llama3.1:8b-instruct-q5_K_M — 5-bit quantisation, fits comfortably in 12 GB VRAM, minimal quality loss vs full FP16.
Pre-warm the model
ollama run llama3.1:8b "" --keepalive 60m
This keeps the model resident in VRAM for an hour (or longer with --keepalive 24h) — avoids 5-10 second cold-start latency on first request after idle.
Match model to task
Configure OpenClaw to route simple tasks to a smaller model:
llm_routing:
default: llama3.1:8b
simple_tasks: llama3.2:3b # faster, used for routine intents
reasoning: mistral-small # used when the agent needs to plan multi-step actions
---
Step 7: Cost optimisation for 24×7 agents
Running an RTX 3090 24/7 at ₹27.74/hour ≈ ₹19,973/month. For an always-on personal assistant, that's expensive — auto-stop when idle saves dramatically.
Two ways to cut cost:
Option A — Auto-stop when idle
OpenClaw doesn't need the GPU when no one's talking to it. Use a simple cron job to stop the GPU instance when there's been no activity for X minutes, and resume on next message:
# Pseudo-script: check Ollama logs for last request, stop GPU if >30 min idle
# Wake the GPU via AIC Cloud's API when next message arrives
If you only actually use the agent ~4 hours/day spread across the day, cost drops to ~₹1,440/month.
Option B — Use a cheaper / smaller GPU
Llama 3.1 8B runs fine on RTX 3060 (12 GB, ~₹6/hour) or even shared GPU instances. Drop down if you don't need 24 GB headroom.
Option C — Dedicated server with GPU add-on
For 24×7 production agents, a dedicated server with a permanent GPU is cheaper at scale than hourly rental.
---
Cost comparison
| Setup | Monthly cost | Notes |
|---|---|---|
| API mode (Claude + ₹99 VPS) | ₹300-600 | Cheapest, best quality, but data leaves your server |
| Local LLM, 4 hr/day on RTX 3090 | ~₹1,440 | Full privacy, good quality (Llama 3.1 8B) |
| Local LLM, 24×7 RTX 3090 | ~₹8,640 | Same as above, always responsive |
| Local LLM, dedicated GPU server | varies, often cheaper at scale | Best for production-grade always-on agents |
---
Troubleshooting
Ollama crashes / OOM — Switch to a more aggressively quantised model (4-bit instead of 8-bit), or drop to a smaller model (8B instead of 22B).
Slow responses (<10 tokens/sec) — You may be running on CPU instead of GPU. Confirm with nvidia-smi that ollama shows up in the GPU process list. If not, reinstall Ollama after confirming CUDA drivers are present.
Quality issues vs Claude/GPT-4o — Expected on complex multi-step reasoning. For tasks where it matters, route those to Claude API specifically (OpenClaw supports per-task model selection). Local LLMs handle 80% of everyday agent tasks fine; the other 20% benefit from frontier models.
GPU running 24/7 burning your wallet — Implement the auto-stop pattern from Step 7, or switch to a dedicated server.
---
When to graduate to dedicated hardware
If you find yourself:
- •Running the GPU 16+ hours/day consistently
- •Wanting to host multiple OpenClaw instances on the same hardware
- •Needing to also host other AI workloads (Stable Diffusion, fine-tuning, etc.)
→ Look at AIC Cloud's dedicated server plans with GPU add-ons. Cost-effective at scale, full hardware isolation, no hourly billing volatility.
---
READY TO GET STARTED?
Deploy your first VPS for ₹99/mo
India-based servers, INR billing, no lock-in contracts. Get started in minutes.
