Why use a local LLM with OpenClaw instead of Claude or OpenAI?

Three reasons: (1) Privacy — your data never leaves your server, which matters for sensitive workflows like legal, medical, or proprietary code. (2) No per-token billing — pay for GPU compute time instead, which can be cheaper at high usage. (3) Experimentation — try custom fine-tuned models, uncensored models, or specialized models for your domain that aren't available via API.

What's the minimum GPU I need to run OpenClaw with a local LLM?

For a usable agent experience, you need at least 12 GB VRAM (RTX 3060 12GB or better) to run a quantised Llama 3.1 8B model. For better reasoning quality with Mistral Small 22B or similar, target 24 GB VRAM (RTX 3090). For frontier-quality 70B models, you need 40+ GB VRAM (A100).

Can I run OpenClaw with a local LLM 24/7 cost-effectively?

On hourly GPU at ₹27.74/hr, 24×7 RTX 3090 costs ~₹19,973/month — expensive for personal use. The cost-effective patterns are: (a) auto-stop the GPU when idle and resume on next message, (b) use a smaller / shared GPU if your model fits, (c) use a dedicated server with GPU add-on for predictable monthly pricing. For light personal use, API mode (₹300-600/month) is much cheaper.

How does local LLM quality compare to Claude or GPT-4o?

On routine tasks (write a message, summarize an email, schedule a meeting, run a command) — local Llama 3.1 8B / Mistral 22B are very close to Claude/GPT-4o. On complex multi-step reasoning, long-context understanding, or nuanced writing, frontier API models still win. A practical setup: route simple tasks to local LLM (free after GPU cost, total privacy), route hard tasks to Claude API as a fallback.

Does Ollama really work with OpenClaw out of the box?

Yes — Ollama exposes an OpenAI-compatible API on port 11434. Just point OpenClaw's LLM config at http://localhost:11434/v1 with any string as the API key, and pick a model you've pulled with 'ollama pull'. No custom integration needed.

Which model is best for OpenClaw — Llama, Mistral, Qwen, or Gemma?

For agent workloads (tool use, multi-step planning, function calling), Llama 3.1 8B is the practical default — well-supported in Ollama, fast on a single RTX 3090, strong on instruction following. Mistral Small 22B is better for reasoning if you have 24 GB+ VRAM. Qwen 2.5 32B is excellent for code-heavy automations. Test each on your actual workload — quality differences are workload-specific.

How do I stop the GPU instance to save money when not in use?

In your AIC Cloud dashboard → Cloud GPU → click 'Stop' on the instance. Billing pauses immediately (per-minute granularity). When you're ready to use OpenClaw again, click 'Start' — Ollama and OpenClaw will auto-resume via Docker's restart policy in under 60 seconds. For full automation, we have an API you can call from a cron job to stop/start based on activity.

Can OpenClaw use both a local LLM and an API LLM at the same time?

Yes — OpenClaw supports per-task model routing. Configure local Llama as the default for privacy-sensitive or routine work, and route specific tools or skills to Claude/GPT-4o API when you need higher quality. This hybrid approach gives you privacy where it matters and frontier quality where it counts.

Self-Host OpenClaw with a Local LLM on a Cloud GPU (Llama / Mistral, ₹12/hr, 2026)

Why run OpenClaw with a local LLM (and when not to)?

In our companion guide we showed how to self-host OpenClaw on a ₹99 VPS using the Claude or OpenAI API. That setup works for 80% of users — cheap, easy, fast.

But there's another mode: running the LLM brain locally too. This means OpenClaw isn't just self-hosted — *nothing* about your AI assistant leaves your server. No API calls to Anthropic. No tokens billed by OpenAI. Your conversations, files, and tool use all stay on hardware you control.

Pick local LLM mode if:

•You handle sensitive data (medical, legal, financial) and can't send it to third-party APIs
•You want to experiment with custom fine-tuned models
•You're rate-limited or banned from a major API provider
•You want predictable cost — pay only for GPU compute, not per-token

Stick with API mode if:

•You want the best raw model quality (frontier models like Claude Sonnet 4.5 and GPT-4o still outperform open-source models on complex reasoning)
•You don't need GPU-grade compute 24/7
•You'd rather not babysit GPU server costs

For most personal use, API mode is more practical. Local LLM mode is for privacy maximalists and ML enthusiasts.

---

What hardware do you actually need?

The model size determines the GPU requirements. Here's the practical map for 2026:

Model	Params	Min VRAM (4-bit)	Min VRAM (8-bit)	Quality	Best GPU
Llama 3.2 3B	3B	4 GB	6 GB	Good for simple tasks	RTX 3060 (12GB)
Llama 3.1 8B	8B	8 GB	12 GB	Solid all-rounder	RTX 3090 (24GB)
Mistral Small 22B	22B	16 GB	24 GB	Strong reasoning	RTX 3090 / 4090
Llama 3.3 70B	70B	40 GB	70 GB	GPT-4 class	A100 80GB
Qwen 2.5 32B	32B	24 GB	40 GB	Excellent code + math	RTX 4090 / A100

Sweet spot for OpenClaw assistant workloads: Llama 3.1 8B or Mistral Small 22B on a single RTX 3090. Strong enough to handle agent reasoning, browser automation planning, multi-step tool use. Fast enough that the assistant feels responsive (~30-60 tokens/sec).

On AIC Cloud, RTX 3090 (24 GB VRAM) is available from ₹27.74/hour, billed per minute. If you only run the agent during your working hours (~8 hrs/day, ~240 hrs/month), that's roughly ₹6,658/month — vs an always-on RTX 3090 setup at ~₹19,973/month.

For 24×7 always-on, consider dedicated GPU hosting (see plans on /cloud-gpu) — typically more cost-effective than hourly above 12 hours/day average.

---

Step 1: Provision a GPU instance

1. Go to Cloud GPU in your dashboard

2. Pick RTX 3090 (24 GB VRAM) for a good all-rounder, or RTX 4090 if you want extra headroom

3. Choose Ubuntu 22.04 + CUDA 12.x as the template

4. Click Deploy — instance is ready in ~60 seconds with NVIDIA drivers pre-installed

Get the SSH credentials from your dashboard and connect:

ssh root@YOUR_GPU_INSTANCE_IP

Confirm the GPU is visible:

nvidia-smi

You should see your RTX 3090 listed with 24 GB VRAM available.

---

Step 2: Install Ollama (the easiest local LLM runtime)

Ollama is the simplest way to serve local LLMs. It handles downloading, quantising, and serving models with a single command.

curl -fsSL https://ollama.com/install.sh | sh
systemctl enable --now ollama

Pull a model — let's start with Llama 3.1 8B (sweet spot for agent workloads):

ollama pull llama3.1:8b

Test that it works:

ollama run llama3.1:8b "Hello, are you working?"

You should see a streaming response in ~1-2 seconds.

Ollama exposes an OpenAI-compatible API on http://localhost:11434/v1 — this is what makes integration with OpenClaw trivial.

---

Step 3: Install OpenClaw

Same flow as the API-mode guide:

apt update && apt install -y git docker.io
systemctl enable --now docker

docker pull openclaw/openclaw:latest
docker run -d --name openclaw --restart unless-stopped \
  --network host \
  -v ~/.openclaw:/root/.openclaw \
  openclaw/openclaw:latest

We use --network host so OpenClaw can reach Ollama on localhost:11434 without exposing the LLM port externally.

---

Step 4: Point OpenClaw at your local LLM

Edit ~/.openclaw/config.yaml to point at Ollama's OpenAI-compatible endpoint:

llm:
  provider: openai             # Ollama speaks OpenAI's API protocol
  model: llama3.1:8b           # or mistral, qwen2.5, etc.
  api_key: ollama              # any non-empty string works
  base_url: http://localhost:11434/v1
  max_tokens: 4096
  temperature: 0.4

Restart OpenClaw:

docker restart openclaw

Now every reasoning step OpenClaw makes hits your own GPU instead of Anthropic or OpenAI. Your conversations never leave this server.

---

Step 5: Connect your messaging app

Same as API-mode setup — Telegram is fastest:

1. Open Telegram, message @BotFather, send /newbot, follow prompts

2. Copy the bot token

3. Paste into OpenClaw config under integrations.telegram.token

4. Restart OpenClaw

Send your bot a message to verify it routes through your local Llama.

---

Step 6: Tune for speed and quality

Llama 3.1 8B on RTX 3090 typically outputs 50-80 tokens/second — fast enough that the agent feels snappy. A few tweaks help:

Use quantised models

ollama pull llama3.1:8b-instruct-q5_K_M — 5-bit quantisation, fits comfortably in 12 GB VRAM, minimal quality loss vs full FP16.

Pre-warm the model

ollama run llama3.1:8b "" --keepalive 60m

This keeps the model resident in VRAM for an hour (or longer with --keepalive 24h) — avoids 5-10 second cold-start latency on first request after idle.

Match model to task

Configure OpenClaw to route simple tasks to a smaller model:

llm_routing:
  default: llama3.1:8b
  simple_tasks: llama3.2:3b   # faster, used for routine intents
  reasoning: mistral-small    # used when the agent needs to plan multi-step actions

---

Step 7: Cost optimisation for 24×7 agents

Running an RTX 3090 24/7 at ₹27.74/hour ≈ ₹19,973/month. For an always-on personal assistant, that's expensive — auto-stop when idle saves dramatically.

Two ways to cut cost:

Option A — Auto-stop when idle

OpenClaw doesn't need the GPU when no one's talking to it. Use a simple cron job to stop the GPU instance when there's been no activity for X minutes, and resume on next message:

# Pseudo-script: check Ollama logs for last request, stop GPU if >30 min idle
# Wake the GPU via AIC Cloud's API when next message arrives

If you only actually use the agent ~4 hours/day spread across the day, cost drops to ~₹1,440/month.

Option B — Use a cheaper / smaller GPU

Llama 3.1 8B runs fine on RTX 3060 (12 GB, ~₹6/hour) or even shared GPU instances. Drop down if you don't need 24 GB headroom.

Option C — Dedicated server with GPU add-on

For 24×7 production agents, a dedicated server with a permanent GPU is cheaper at scale than hourly rental.

---

Cost comparison

Setup	Monthly cost	Notes
API mode (Claude + ₹99 VPS)	₹300-600	Cheapest, best quality, but data leaves your server
Local LLM, 4 hr/day on RTX 3090	~₹1,440	Full privacy, good quality (Llama 3.1 8B)
Local LLM, 24×7 RTX 3090	~₹8,640	Same as above, always responsive
Local LLM, dedicated GPU server	varies, often cheaper at scale	Best for production-grade always-on agents

---

Troubleshooting

Ollama crashes / OOM — Switch to a more aggressively quantised model (4-bit instead of 8-bit), or drop to a smaller model (8B instead of 22B).

Slow responses (<10 tokens/sec) — You may be running on CPU instead of GPU. Confirm with nvidia-smi that ollama shows up in the GPU process list. If not, reinstall Ollama after confirming CUDA drivers are present.

Quality issues vs Claude/GPT-4o — Expected on complex multi-step reasoning. For tasks where it matters, route those to Claude API specifically (OpenClaw supports per-task model selection). Local LLMs handle 80% of everyday agent tasks fine; the other 20% benefit from frontier models.

GPU running 24/7 burning your wallet — Implement the auto-stop pattern from Step 7, or switch to a dedicated server.

---

When to graduate to dedicated hardware

If you find yourself:

•Running the GPU 16+ hours/day consistently
•Wanting to host multiple OpenClaw instances on the same hardware
•Needing to also host other AI workloads (Stable Diffusion, fine-tuning, etc.)

→ Look at AIC Cloud's dedicated server plans with GPU add-ons. Cost-effective at scale, full hardware isolation, no hourly billing volatility.

---

Spin up an RTX 3090 cloud GPU instance for OpenClaw →

Tags:OpenClawAI AgentsGPULocal LLMSelf-HostingLlamaTutorial

READY TO GET STARTED?

Deploy your first VPS for ₹99/mo

India-based servers, INR billing, no lock-in contracts. Get started in minutes.

View VPS Plans →