Adventures with Local LLMs

Published on April 28, 2026

I use AI assistants every day for coding — Claude Code, GPT, the Gemini API. Throughout the first half of 2026, I kept accumulating three frustrations with cloud APIs.

Token pressure. Paid plans have been tightening their available limits with each renewal cycle. Every serious project eventually hit the ceiling and required either waiting for the reset or buying additional packages. Rising cost, decreasing control.

Models disappear. In an electrical transformer auditing application, gemini-1.5-pro was discontinued overnight. gemini-2.0-flash became blocked for new users. The only option left was migrating to gemini-2.5-flash while it lasts. When the pipeline depends on a cloud-hosted model, it has an expiration date.

Sensitive data leaving the house. Every photo of a transformer nameplate sent to Google's API is client data leaving the network.

The answer, in three projects built throughout the first half of 2026, was to bring the stack home: a dedicated LLM server, a client pointing at it instead of the Anthropic API, and a real application proving it works in production.

There is an honest trade-off: context costs memory. The model+context combo must fit within the available VRAM, and that equation constrains choices. But the gain — control over what runs, what leaves the machine, and how much it costs per token (zero) — makes up for it.

The three projects: ancalagon-llm (the server setup with an RTX 4070 Ti SUPER and llama.cpp tuned for MoE), local-claude (a wrapper that makes Claude Code use local models), and ocr-pipeline-local (a vision pipeline replacing the Gemini API).

Numbers at a Glance

36.8 tok/s

Qwen3.6-27B local, 100% on GPU

~6×

Speedup with speculative decoding (RTX 4070 Ti SUPER)

81.5 tok/s

Qwen3-Coder 30B (tuned MoE)

13 GB

Model fitting in 16 GB of VRAM (TQ3_4S)

176

Automated tests in the OCR pipeline

≥85%

Field accuracy (real feeder)

5

Inference backends supported

0

External API calls in production

The Server — ancalagon-llm

The first piece of the stack is the server. The hardware is a dual-boot PC (Windows + Ubuntu Server 24.04) with a Ryzen 7600X, 32 GB of RAM, and an RTX 4070 Ti SUPER 16 GB. The Mac connects via Tailscale at 100.64.0.10.

Why leave LM Studio

LM Studio was convenient but leaving roughly half of the tokens on the table. During inference, GPU utilization sat at 30–34%, drawing about 70 W out of a 285 W TGP. Two bottlenecks were identified:

Generic offload doesn't work for MoE. LM Studio splits “N% of the layers on the GPU”. For Mixture-of-Experts models, what matters is keeping attention/norm on the GPU and the experts on the CPU. The -n-cpu-moe flag in llama.cpp does exactly that. LM Studio doesn't expose it.
Model larger than VRAM. Qwen3.6-27B Q4_K_M takes 17 GB — doesn't fit in 16 GB. With the TQ3_4S quant (3-bit ternary, turbo-tan/llama.cpp-tq3 fork) it drops to 13 GB and fits 100% on GPU.

Measured gains

Configuration	GPU util	Power	tok/s gen
LM Studio — qwen3-coder Q4_K_M	30%	67 W	64.5
llama.cpp upstream — coder `-ncmoe 10`	36%	101 W	81.5
LM Studio — Qwen3.6-27B Q4_K_M	34%	94 W	13.7
llama.cpp TQ3 fork — Qwen3.6-27B-TQ3_4S	96%	292 W	36.8

Three presets, one port

The server exposes three models via systemd, each in a service with Conflicts= declared — only one is up at a time, and systemd stops the previous one automatically:

Qwen3-Coder 30B — coding (MoE with --n-cpu-moe 16, ctx 96K)
Qwen3.6-27B TQ3_4S — reasoning/thinking (100% GPU, ctx 40K)
Gemma 4 26B-A4B — general use (MoE with --n-cpu-moe 8, ctx 96K)

All on the same port 1234 (the same as LM Studio) — existing clients don't need to change URLs. An lmswitch wrapper alternates between services with health polling. SSH aliases on the Mac (llcoder, llq36, llgemma4, lloff) make remote control trivial.

Usage flow

Glaurung (Mac)                    Ancalagon (Ubuntu)
                                  100.64.0.10 / :1234
aliases:                          systemd --user:
  llcoder  ──ssh──>                 llama-coder.service ──┐
  llq36    ──ssh──>                 llama-qwen36.service ─┤ Conflicts=
  llgemma4 ──ssh──>                 llama-gemma4.service ─┘ (only one up)
  lloff    ──ssh──>                       │
                                          ▼
                                   :1234 (OpenAI-compat API)
                                          ▲
curl http://100.64.0.10:1234 ────────────┘

The Client — local-claude

Claude Code is an excellent coding agent, but it only talks to the Anthropic API. I wanted the agent, not the vendor lock-in. local-claude is a bash wrapper that injects environment variables (ANTHROPIC_BASE_URL, CLAUDE_CONFIG_DIR=~/.claude-local to isolate config) and routes traffic to a local or remote OpenAI-compatible server. The original claude stays untouched.

Five backends

LM Studio — connects to a local server
llama.cpp local — spawns and kills a llama-server automatically
llama.cpp remote via SSH — same, but on a remote host (the Mac → Ancalagon scenario)
remote — connects to an already-running server (e.g., the Ancalagon service)
Apple Intelligence via apfel — on-device model on macOS Tahoe

Automatic speculative decoding

When a smaller model from the same family exists in the directory, the script enables speculative decoding: a small “draft” generates candidate tokens that the large model verifies in batch. Accepted tokens are free; rejected ones are regenerated normally.

Platform (Qwen2.5-7B Q8_0 + 0.5B Q8_0 draft)	No draft	With draft	Speedup
Apple M4 Pro (24 GB)	29 t/s	57 t/s	~2×
RTX 4070 Ti SUPER (16 GB)	29 t/s	177 t/s	~6×

Insight: the smallest draft wins. The 3B draft is slower than the 1.5B despite a higher acceptance rate — verification overhead dominates.

Apple Intelligence is a special case

The Apple Intelligence context window is 4096 tokens. Claude Code's system prompt and tools alone already total around 27K tokens — seven times more than fits. The backend runs in chat-only mode (--bare --tools ""): you can have a conversation, but the agent cannot use tools (edit files, run commands). It's a physical reminder that context = memory.

Integrating with the server

On the Mac, the srl-coder alias connects directly to whichever service is up (specstory run claude -c "local-claude --backend remote --port 1234"). Zero process management on the client: the Ancalagon service is the infrastructure, local-claude is just the bridge.

The Application — Local OCR Pipeline

The first two pieces are development tools. The third is a concrete application: a vision pipeline replacing the Google Gemini API in an electrical transformer auditing system. Field photos of equipment nameplates go in, structured JSON with 11 fields comes out (rated power, manufacturer, serial number, manufacturing date, fuse link, primary and secondary voltages, phases, self-protection, asset tag, registration ID).

Architecture

Dedicated server: another PC with an RTX 5070 12 GB, Ubuntu 24.04, CUDA 12.8. The pipeline has two main stages and two auxiliary ones:

nameplate ─▶ YOLO Stage 0 ─▶ Stage 1 (vision) ─▶ Stage 2 (reasoning) ─▶ JSON
  photo         crop the         Qwen3.5-VLM         qwen2.5:14b           ▲
 (SMB or       nameplate         9B (llama.cpp)     (Ollama)               │
  base64           │                :8090            :11434                │
  upload)          │                                    │                  │
                   ▼                                    ▼                  │
             YOLO Stage 3 ◀─────── audit ──────────────┘                   │
              (revalidates detected classes)                               │
                                                                           │
                                  OCR × records cross-check ───────────────┘

Stage 0 (YOLO) — detects the placa_transformador bounding box, crops with 10% padding and passes only the nameplate region to the VLM at high resolution. Fallback: resize to 1600 px if YOLO fails.
Stage 1 (vision) — Qwen3.5-VLM 9B via llama.cpp extracts all visible text from the cropped image.
Stage 2 (reasoning) — qwen2.5:14b via Ollama receives the raw text and returns JSON with 11 validated fields.
Stage 3 (YOLO) — post-Stage 2 audit: if manufacturer was extracted but YOLO didn't see any placa_identificacao, mark _needs_review=True for human review.

Why two stages and not a single model

I tested seven vision models (glm-ocr, qwen2.5vl, Qianfan-OCR, Gemma 4 E4B, among others). Qwen3.5-VLM reads well, but it's not reliable for emitting structured JSON directly — it hallucinated fields, swapped values, dropped numbers. Splitting pure OCR from structured reasoning gave 11/11 fields correct on the initial validation set (Romagnole 112.5 kVA nameplates).

Three weeks of evolution

The pipeline went from zero (v1.0) to v3.1 with 11/11 fields in a single session (April 6). Over 18 days, it evolved through v3.2 → v3.7 → v4.0 → v4.1-dev, ending with 176 automated tests and ≥85.3% field accuracy on an electrical grid feeder (34 poles compared against records).

Practical findings

Python filters > prompt rules. qwen2.5:14b doesn't follow complex rules in natural language. Every time I was tempted to add a rule to the prompt, regex in Python turned out to be more reliable. Filters like _classify_text and _filter_ocr_for_stage2 separate nameplate text from meter text, pole engravings, and telecom labels before Stage 2 ever sees them.
YOLO as context, not just crop. The detected class (cls4=placa_identificacao vs cls6=placa_transformador) propagates a pole_id_context that steers downstream filters. Detecting is easy; using detection to change pipeline behavior is what generates the gain.
GPU semaphore. Stage 1 (~7 GB) + Stage 2 (~9 GB) exceed the RTX 5070's 12 GB. Solution: threading.Semaphore(1) in FastAPI, serializing stages within a single request.
banco_trafo via DB tags. Detecting transformer banks by counting nameplates in the same YOLO frame doesn't work — each transformer of the bank has its own photo. Solution: count distinct images tagged placa_transformador in the database. If ≥ 2 → bank confirmed, and the comparison becomes OCR_pwr × N vs records (±5%).

Integrating with the consumer application

The REST API runs on FastAPI on port 8091 with a Gemini-style envelope (contents[].parts[].inline_data) so it's a drop-in for the C# code that previously called Google. The application triggers it via two new schedule types in the existing scheduler:

TRA — processes an entire feeder reading images via SMB.
ALT — sends images directly as base64 (batches of up to 50 poles / 30 images per pole / 200 MB).

There's also an OCR-only endpoint POST /api/v41/ocr/poste/{id}/texto for cases where only the raw text matters — used for pole data extraction (separated from transformer data, with its own filters).

The Price of Context: Memory

In the cloud, context appears free (Claude has 1 M, Gemini has 2 M). Locally, context is VRAM. The equation is simple:

VRAM = model weights + KV cache

The KV cache grows linearly with context and batch size. Each combination of model × ctx × quantization × KV quant either fits on the card or it doesn't. In practice, this is how it played out across the three projects:

Server	VRAM	Model	Quant	Ctx	KV	Note
Ancalagon	16 GB	Qwen3-Coder 30B (MoE)	Q4_K_M	96K	q4_0	experts on CPU (`-ncmoe 16`)
Ancalagon	16 GB	Qwen3.6-27B	TQ3_4S	40K	q8_0	100% GPU, 96% util
Ancalagon	16 GB	Gemma 4 26B (MoE)	Q4_K_M	96K	q4_0	`-ncmoe 8`
OCR server	12 GB	Qwen3.5-VLM + qwen2.5	Q4 / Q5	n/a	—	serialized via semaphore
Mac (apfel)	unified	Apple Intelligence	—	4096	—	chat-only, no tools

For Claude Code (system prompt ~27K + tools), 4096 tokens doesn't work. But 96K with a 30B MoE model on 16 GB of consumer-grade VRAM works — and that was unthinkable two years ago.

Lessons Learned

KV quant in LM Studio requires a specific wrapper. Without {checked: true, value: "q4_0"} the field is silently ignored and the cache stays in f16. Discovered by measuring VRAM before and after — nothing in the UI signals it.
Asymmetric K and V break the coder. K=q8 / V=q4 → catastrophic CUDA fallback on Qwen3-Coder (3 tok/s). Match them or don't use it.
Blackwell GPU hang. The RTX 5070 dropped off the PCIe bus under intermittent load. Five layers of protection: persistence mode at boot, pcie_aspm=off, semaphore in FastAPI, GpuHangError propagated to the client, and a drop-in to stop concurrent processes before the main service comes up.
gemini-1.5-pro 404 overnight. No warning. Emergency migration to gemini-2.5-flash in production. A physical reminder of why having a local fallback is worth it.
File.Exists() silent in C#. For an unreachable SMB path, it returns false without an exception. Result: zero images sent to Gemini with no error message. Always check network paths explicitly.
JSON serializer mixing in C#. [JsonPropertyName] (System.Text.Json) is ignored by Newtonsoft.Json. Result: every field as “NI” with no error. Always confirm which serializer actually performs deserialization.
Python filters > prompt rules. Mid-sized models don't follow complex rules in natural language. When regex solves it, regex solves it better.
YOLO as context, not just crop. The detected class propagates state to downstream filters and changes pipeline behavior. Detecting is easy; using detection is what differentiates.

Reflection

The main gain isn't cost savings (although those are real — zero marginal cost per token). It's iterating without fear.

When every call costs, you prune experiments before they start. Locally, you can run 4 A/B configurations on the same dataset (25 poles × 4 = 100 inferences) in an afternoon without thinking twice — which is exactly what I did to validate that the baseline Qwen3.5-VLM + qwen2.5:14b beats the alternatives (Qwen3-VL-8B, qwen3:14b with /no_think, qwen3:8b).

And when the model is yours, it doesn't disappear. Today's Qwen3-Coder Q4_K_M will be running exactly the same in 2030, if I want it to. That changes the time scale of projects: model dependencies stop being operational risk.

The trade-off is real: context is finite. For tasks that need over 100K tokens (large refactor, gigantic codebase), Claude in the cloud still wins. But for 90% of what I do day to day — focused features, iterative debugging, nameplate OCR, structured validation — the local stack handles it. And it handles it without the next plan renewal tightening limits coming back as a source of friction.