Published on April 28, 2026
I use AI assistants every day for coding — Claude Code, GPT, the Gemini API. Throughout the first half of 2026, I kept accumulating three frustrations with cloud APIs.
Token pressure. Paid plans have been tightening their available limits with each renewal cycle. Every serious project eventually hit the ceiling and required either waiting for the reset or buying additional packages. Rising cost, decreasing control.
Models disappear. In an electrical
transformer auditing application, gemini-1.5-pro was
discontinued overnight. gemini-2.0-flash became blocked for
new users. The only option left was migrating to
gemini-2.5-flash while it lasts. When the pipeline depends on
a cloud-hosted model, it has an expiration date.
Sensitive data leaving the house. Every photo of a transformer nameplate sent to Google's API is client data leaving the network.
The answer, in three projects built throughout the first half of 2026, was to bring the stack home: a dedicated LLM server, a client pointing at it instead of the Anthropic API, and a real application proving it works in production.
There is an honest trade-off: context costs memory. The model+context combo must fit within the available VRAM, and that equation constrains choices. But the gain — control over what runs, what leaves the machine, and how much it costs per token (zero) — makes up for it.
The three projects: ancalagon-llm (the server setup with an RTX 4070 Ti SUPER and llama.cpp tuned for MoE), local-claude (a wrapper that makes Claude Code use local models), and ocr-pipeline-local (a vision pipeline replacing the Gemini API).
Qwen3.6-27B local, 100% on GPU
Speedup with speculative decoding (RTX 4070 Ti SUPER)
Qwen3-Coder 30B (tuned MoE)
Model fitting in 16 GB of VRAM (TQ3_4S)
Automated tests in the OCR pipeline
Field accuracy (real feeder)
Inference backends supported
External API calls in production
The first piece of the stack is the server. The
hardware is a dual-boot PC (Windows + Ubuntu Server 24.04)
with a Ryzen 7600X, 32 GB of RAM, and an
RTX 4070 Ti SUPER 16 GB. The Mac connects via
Tailscale at 100.64.0.10.
LM Studio was convenient but leaving roughly half of the tokens on the table. During inference, GPU utilization sat at 30–34%, drawing about 70 W out of a 285 W TGP. Two bottlenecks were identified:
-n-cpu-moe flag
in llama.cpp does exactly that. LM Studio doesn't expose it.turbo-tan/llama.cpp-tq3 fork) it drops to
13 GB and fits 100% on GPU.| Configuration | GPU util | Power | tok/s gen |
|---|---|---|---|
| LM Studio — qwen3-coder Q4_K_M | 30% | 67 W | 64.5 |
llama.cpp upstream — coder -ncmoe 10 |
36% | 101 W | 81.5 |
| LM Studio — Qwen3.6-27B Q4_K_M | 34% | 94 W | 13.7 |
| llama.cpp TQ3 fork — Qwen3.6-27B-TQ3_4S | 96% | 292 W | 36.8 |
The server exposes three models via systemd, each in
a service with Conflicts= declared — only one is up at a
time, and systemd stops the previous one automatically:
--n-cpu-moe 16, ctx 96K)--n-cpu-moe 8, ctx 96K)All on the same port 1234 (the same
as LM Studio) — existing clients don't need to change URLs. An
lmswitch wrapper alternates between services with health
polling. SSH aliases on the Mac (llcoder, llq36,
llgemma4, lloff) make remote control trivial.
Glaurung (Mac) Ancalagon (Ubuntu)
100.64.0.10 / :1234
aliases: systemd --user:
llcoder ──ssh──> llama-coder.service ──┐
llq36 ──ssh──> llama-qwen36.service ─┤ Conflicts=
llgemma4 ──ssh──> llama-gemma4.service ─┘ (only one up)
lloff ──ssh──> │
▼
:1234 (OpenAI-compat API)
▲
curl http://100.64.0.10:1234 ────────────┘
Claude Code is an excellent coding agent, but it
only talks to the Anthropic API. I wanted the agent, not the vendor
lock-in. local-claude is a bash wrapper that injects
environment variables (ANTHROPIC_BASE_URL,
CLAUDE_CONFIG_DIR=~/.claude-local to isolate config) and
routes traffic to a local or remote OpenAI-compatible server. The
original claude stays untouched.
llama-server automaticallyremote — connects to an
already-running server (e.g., the Ancalagon service)When a smaller model from the same family exists in the directory, the script enables speculative decoding: a small “draft” generates candidate tokens that the large model verifies in batch. Accepted tokens are free; rejected ones are regenerated normally.
| Platform (Qwen2.5-7B Q8_0 + 0.5B Q8_0 draft) | No draft | With draft | Speedup |
|---|---|---|---|
| Apple M4 Pro (24 GB) | 29 t/s | 57 t/s | ~2× |
| RTX 4070 Ti SUPER (16 GB) | 29 t/s | 177 t/s | ~6× |
Insight: the smallest draft wins. The 3B draft is slower than the 1.5B despite a higher acceptance rate — verification overhead dominates.
The Apple Intelligence context window is
4096 tokens. Claude Code's system prompt and tools alone
already total around 27K tokens — seven times more than fits. The
backend runs in chat-only mode
(--bare --tools ""): you can have a conversation, but the
agent cannot use tools (edit files, run commands). It's a physical reminder
that context = memory.
On the Mac, the srl-coder alias
connects directly to whichever service is up (specstory run claude
-c "local-claude --backend remote --port 1234"). Zero
process management on the client: the Ancalagon service is the
infrastructure, local-claude is just the bridge.
The first two pieces are development tools. The third is a concrete application: a vision pipeline replacing the Google Gemini API in an electrical transformer auditing system. Field photos of equipment nameplates go in, structured JSON with 11 fields comes out (rated power, manufacturer, serial number, manufacturing date, fuse link, primary and secondary voltages, phases, self-protection, asset tag, registration ID).
Dedicated server: another PC with an RTX 5070 12 GB, Ubuntu 24.04, CUDA 12.8. The pipeline has two main stages and two auxiliary ones:
nameplate ─▶ YOLO Stage 0 ─▶ Stage 1 (vision) ─▶ Stage 2 (reasoning) ─▶ JSON
photo crop the Qwen3.5-VLM qwen2.5:14b ▲
(SMB or nameplate 9B (llama.cpp) (Ollama) │
base64 │ :8090 :11434 │
upload) │ │ │
▼ ▼ │
YOLO Stage 3 ◀─────── audit ──────────────┘ │
(revalidates detected classes) │
│
OCR × records cross-check ───────────────┘
placa_transformador bounding box, crops with 10% padding
and passes only the nameplate region to the VLM at high resolution.
Fallback: resize to 1600 px if YOLO fails.placa_identificacao, mark _needs_review=True
for human review.I tested seven vision models (glm-ocr, qwen2.5vl, Qianfan-OCR, Gemma 4 E4B, among others). Qwen3.5-VLM reads well, but it's not reliable for emitting structured JSON directly — it hallucinated fields, swapped values, dropped numbers. Splitting pure OCR from structured reasoning gave 11/11 fields correct on the initial validation set (Romagnole 112.5 kVA nameplates).
The pipeline went from zero (v1.0) to v3.1 with 11/11 fields in a single session (April 6). Over 18 days, it evolved through v3.2 → v3.7 → v4.0 → v4.1-dev, ending with 176 automated tests and ≥85.3% field accuracy on an electrical grid feeder (34 poles compared against records).
_classify_text and
_filter_ocr_for_stage2 separate nameplate text from meter
text, pole engravings, and telecom labels before Stage 2
ever sees them.cls4=placa_identificacao vs
cls6=placa_transformador) propagates a
pole_id_context that steers downstream filters. Detecting
is easy; using detection to change pipeline behavior is what generates
the gain.threading.Semaphore(1) in FastAPI, serializing stages
within a single request.placa_transformador in
the database. If ≥ 2 → bank confirmed, and the comparison
becomes OCR_pwr × N vs records (±5%).The REST API runs on FastAPI on port
8091 with a Gemini-style envelope
(contents[].parts[].inline_data) so it's a drop-in for the
C# code that previously called Google. The application triggers it via
two new schedule types in the existing scheduler:
There's also an OCR-only endpoint
POST /api/v41/ocr/poste/{id}/texto for cases where only the
raw text matters — used for pole data extraction
(separated from transformer data, with its own filters).
In the cloud, context appears free (Claude has 1 M, Gemini has 2 M). Locally, context is VRAM. The equation is simple:
VRAM = model weights + KV cache
The KV cache grows linearly with context and batch size. Each combination of model × ctx × quantization × KV quant either fits on the card or it doesn't. In practice, this is how it played out across the three projects:
| Server | VRAM | Model | Quant | Ctx | KV | Note |
|---|---|---|---|---|---|---|
| Ancalagon | 16 GB | Qwen3-Coder 30B (MoE) | Q4_K_M | 96K | q4_0 | experts on CPU (-ncmoe 16) |
| Ancalagon | 16 GB | Qwen3.6-27B | TQ3_4S | 40K | q8_0 | 100% GPU, 96% util |
| Ancalagon | 16 GB | Gemma 4 26B (MoE) | Q4_K_M | 96K | q4_0 | -ncmoe 8 |
| OCR server | 12 GB | Qwen3.5-VLM + qwen2.5 | Q4 / Q5 | n/a | — | serialized via semaphore |
| Mac (apfel) | unified | Apple Intelligence | — | 4096 | — | chat-only, no tools |
For Claude Code (system prompt ~27K + tools), 4096 tokens doesn't work. But 96K with a 30B MoE model on 16 GB of consumer-grade VRAM works — and that was unthinkable two years ago.
{checked: true, value: "q4_0"} the field
is silently ignored and the cache stays in f16. Discovered by measuring
VRAM before and after — nothing in the UI signals it.K=q8 / V=q4 → catastrophic CUDA fallback on Qwen3-Coder
(3 tok/s). Match them or don't use it.pcie_aspm=off, semaphore in FastAPI,
GpuHangError propagated to the client, and a drop-in to
stop concurrent processes before the main service comes up.gemini-1.5-pro 404 overnight. No warning.
Emergency migration to gemini-2.5-flash in production.
A physical reminder of why having a local fallback is worth it.File.Exists() silent in C#. For an
unreachable SMB path, it returns false without an exception.
Result: zero images sent to Gemini with no error message. Always check
network paths explicitly.[JsonPropertyName] (System.Text.Json) is ignored by
Newtonsoft.Json. Result: every field as “NI” with no error.
Always confirm which serializer actually performs deserialization.The main gain isn't cost savings (although those are real — zero marginal cost per token). It's iterating without fear.
When every call costs, you prune experiments before
they start. Locally, you can run 4 A/B configurations on the same dataset
(25 poles × 4 = 100 inferences) in an afternoon without thinking
twice — which is exactly what I did to validate that the baseline
Qwen3.5-VLM + qwen2.5:14b beats the alternatives (Qwen3-VL-8B,
qwen3:14b with /no_think, qwen3:8b).
And when the model is yours, it doesn't disappear. Today's Qwen3-Coder Q4_K_M will be running exactly the same in 2030, if I want it to. That changes the time scale of projects: model dependencies stop being operational risk.
The trade-off is real: context is finite. For tasks that need over 100K tokens (large refactor, gigantic codebase), Claude in the cloud still wins. But for 90% of what I do day to day — focused features, iterative debugging, nameplate OCR, structured validation — the local stack handles it. And it handles it without the next plan renewal tightening limits coming back as a source of friction.