Skip to content
TUTORIALS

Self-Host OpenClaw with a Local LLM on a Cloud GPU (Llama / Mistral, ₹12/hr, 2026)

AIC Cloud Team23 May 202610 min read

Why run OpenClaw with a local LLM (and when not to)?

In our companion guide we showed how to self-host OpenClaw on a ₹99 VPS using the Claude or OpenAI API. That setup works for 80% of users — cheap, easy, fast.

But there's another mode: running the LLM brain locally too. This means OpenClaw isn't just self-hosted — *nothing* about your AI assistant leaves your server. No API calls to Anthropic. No tokens billed by OpenAI. Your conversations, files, and tool use all stay on hardware you control.

Pick local LLM mode if:

  • You handle sensitive data (medical, legal, financial) and can't send it to third-party APIs
  • You want to experiment with custom fine-tuned models
  • You're rate-limited or banned from a major API provider
  • You want predictable cost — pay only for GPU compute, not per-token

Stick with API mode if:

  • You want the best raw model quality (frontier models like Claude Sonnet 4.5 and GPT-4o still outperform open-source models on complex reasoning)
  • You don't need GPU-grade compute 24/7
  • You'd rather not babysit GPU server costs

For most personal use, API mode is more practical. Local LLM mode is for privacy maximalists and ML enthusiasts.

---

What hardware do you actually need?

The model size determines the GPU requirements. Here's the practical map for 2026:

ModelParamsMin VRAM (4-bit)Min VRAM (8-bit)QualityBest GPU
Llama 3.2 3B3B4 GB6 GBGood for simple tasksRTX 3060 (12GB)
Llama 3.1 8B8B8 GB12 GBSolid all-rounderRTX 3090 (24GB)
Mistral Small 22B22B16 GB24 GBStrong reasoningRTX 3090 / 4090
Llama 3.3 70B70B40 GB70 GBGPT-4 classA100 80GB
Qwen 2.5 32B32B24 GB40 GBExcellent code + mathRTX 4090 / A100

Sweet spot for OpenClaw assistant workloads: Llama 3.1 8B or Mistral Small 22B on a single RTX 3090. Strong enough to handle agent reasoning, browser automation planning, multi-step tool use. Fast enough that the assistant feels responsive (~30-60 tokens/sec).

On AIC Cloud, RTX 3090 (24 GB VRAM) is available from ₹27.74/hour, billed per minute. If you only run the agent during your working hours (~8 hrs/day, ~240 hrs/month), that's roughly ₹6,658/month — vs an always-on RTX 3090 setup at ~₹19,973/month.

For 24×7 always-on, consider dedicated GPU hosting (see plans on /cloud-gpu) — typically more cost-effective than hourly above 12 hours/day average.

---

Step 1: Provision a GPU instance

Sign up at aiccloud.in, top up via UPI (₹500 covers about 40 hours of RTX 3090 time — plenty to test). Then:

1. Go to Cloud GPU in your dashboard

2. Pick RTX 3090 (24 GB VRAM) for a good all-rounder, or RTX 4090 if you want extra headroom

3. Choose Ubuntu 22.04 + CUDA 12.x as the template

4. Click Deploy — instance is ready in ~60 seconds with NVIDIA drivers pre-installed

Get the SSH credentials from your dashboard and connect:

ssh root@YOUR_GPU_INSTANCE_IP

Confirm the GPU is visible:

nvidia-smi

You should see your RTX 3090 listed with 24 GB VRAM available.

---

Step 2: Install Ollama (the easiest local LLM runtime)

Ollama is the simplest way to serve local LLMs. It handles downloading, quantising, and serving models with a single command.

curl -fsSL https://ollama.com/install.sh | sh
systemctl enable --now ollama

Pull a model — let's start with Llama 3.1 8B (sweet spot for agent workloads):

ollama pull llama3.1:8b

Test that it works:

ollama run llama3.1:8b "Hello, are you working?"

You should see a streaming response in ~1-2 seconds.

Ollama exposes an OpenAI-compatible API on http://localhost:11434/v1 — this is what makes integration with OpenClaw trivial.

---

Step 3: Install OpenClaw

Same flow as the API-mode guide:

apt update && apt install -y git docker.io
systemctl enable --now docker

docker pull openclaw/openclaw:latest
docker run -d --name openclaw --restart unless-stopped \
  --network host \
  -v ~/.openclaw:/root/.openclaw \
  openclaw/openclaw:latest

We use --network host so OpenClaw can reach Ollama on localhost:11434 without exposing the LLM port externally.

---

Step 4: Point OpenClaw at your local LLM

Edit ~/.openclaw/config.yaml to point at Ollama's OpenAI-compatible endpoint:

llm:
  provider: openai             # Ollama speaks OpenAI's API protocol
  model: llama3.1:8b           # or mistral, qwen2.5, etc.
  api_key: ollama              # any non-empty string works
  base_url: http://localhost:11434/v1
  max_tokens: 4096
  temperature: 0.4

Restart OpenClaw:

docker restart openclaw

Now every reasoning step OpenClaw makes hits your own GPU instead of Anthropic or OpenAI. Your conversations never leave this server.

---

Step 5: Connect your messaging app

Same as API-mode setup — Telegram is fastest:

1. Open Telegram, message @BotFather, send /newbot, follow prompts

2. Copy the bot token

3. Paste into OpenClaw config under integrations.telegram.token

4. Restart OpenClaw

Send your bot a message to verify it routes through your local Llama.

---

Step 6: Tune for speed and quality

Llama 3.1 8B on RTX 3090 typically outputs 50-80 tokens/second — fast enough that the agent feels snappy. A few tweaks help:

Use quantised models

ollama pull llama3.1:8b-instruct-q5_K_M — 5-bit quantisation, fits comfortably in 12 GB VRAM, minimal quality loss vs full FP16.

Pre-warm the model

ollama run llama3.1:8b "" --keepalive 60m

This keeps the model resident in VRAM for an hour (or longer with --keepalive 24h) — avoids 5-10 second cold-start latency on first request after idle.

Match model to task

Configure OpenClaw to route simple tasks to a smaller model:

llm_routing:
  default: llama3.1:8b
  simple_tasks: llama3.2:3b   # faster, used for routine intents
  reasoning: mistral-small    # used when the agent needs to plan multi-step actions

---

Step 7: Cost optimisation for 24×7 agents

Running an RTX 3090 24/7 at ₹27.74/hour ≈ ₹19,973/month. For an always-on personal assistant, that's expensive — auto-stop when idle saves dramatically.

Two ways to cut cost:

Option A — Auto-stop when idle

OpenClaw doesn't need the GPU when no one's talking to it. Use a simple cron job to stop the GPU instance when there's been no activity for X minutes, and resume on next message:

# Pseudo-script: check Ollama logs for last request, stop GPU if >30 min idle
# Wake the GPU via AIC Cloud's API when next message arrives

If you only actually use the agent ~4 hours/day spread across the day, cost drops to ~₹1,440/month.

Option B — Use a cheaper / smaller GPU

Llama 3.1 8B runs fine on RTX 3060 (12 GB, ~₹6/hour) or even shared GPU instances. Drop down if you don't need 24 GB headroom.

Option C — Dedicated server with GPU add-on

For 24×7 production agents, a dedicated server with a permanent GPU is cheaper at scale than hourly rental.

---

Cost comparison

SetupMonthly costNotes
API mode (Claude + ₹99 VPS)₹300-600Cheapest, best quality, but data leaves your server
Local LLM, 4 hr/day on RTX 3090~₹1,440Full privacy, good quality (Llama 3.1 8B)
Local LLM, 24×7 RTX 3090~₹8,640Same as above, always responsive
Local LLM, dedicated GPU servervaries, often cheaper at scaleBest for production-grade always-on agents

---

Troubleshooting

Ollama crashes / OOM — Switch to a more aggressively quantised model (4-bit instead of 8-bit), or drop to a smaller model (8B instead of 22B).

Slow responses (<10 tokens/sec) — You may be running on CPU instead of GPU. Confirm with nvidia-smi that ollama shows up in the GPU process list. If not, reinstall Ollama after confirming CUDA drivers are present.

Quality issues vs Claude/GPT-4o — Expected on complex multi-step reasoning. For tasks where it matters, route those to Claude API specifically (OpenClaw supports per-task model selection). Local LLMs handle 80% of everyday agent tasks fine; the other 20% benefit from frontier models.

GPU running 24/7 burning your wallet — Implement the auto-stop pattern from Step 7, or switch to a dedicated server.

---

When to graduate to dedicated hardware

If you find yourself:

  • Running the GPU 16+ hours/day consistently
  • Wanting to host multiple OpenClaw instances on the same hardware
  • Needing to also host other AI workloads (Stable Diffusion, fine-tuning, etc.)

→ Look at AIC Cloud's dedicated server plans with GPU add-ons. Cost-effective at scale, full hardware isolation, no hourly billing volatility.

---

Spin up an RTX 3090 cloud GPU instance for OpenClaw →

Tags:OpenClawAI AgentsGPULocal LLMSelf-HostingLlamaTutorial

READY TO GET STARTED?

Deploy your first VPS for ₹99/mo

India-based servers, INR billing, no lock-in contracts. Get started in minutes.

View VPS Plans →
Back to all articles

Chat with us

We reply within minutes