2026-06-16

Local Models for Coding: What the Hacker News Thread Reveals

From a Claude/Claude Code perspective, this HN thread is interesting because it’s not a vague “local models are cool” debate — it’s a very practical report from people actually trying to replace frontier models in daily coding workflows. The discussion gets into harnesses, prompt caching, offline setups, and the very unglamorous reality of keeping agentic coding stable.

Key Points

The original question asks whether anyone has fully replaced Claude/GPT with a local model for daily coding, not just side experiments.
One user says they have, using a containerized Pi coding harness, fully offline, with Qwen 3.6 35B on a Mac Studio or MacBook, and Qwen 3.5 122B for harder tasks.
That user says local models require much more precise prompting than Claude: they do less “thinking for you,” can fall into loops, and may choose the easiest implementation path rather than the best architecture.
Their comparison is blunt: Qwen 3.6 35B feels like a junior who needs guidance, while Claude Opus feels like a senior who thinks through architecture with you.
Another commenter describes a similar offline-ish setup on a Strix Halo 128 GiB laptop, with Pi in a container talking to llama.cpp in another container.
That commenter says they mostly use Qwen 3.6 35B-A3B for agentic coding, with Gemma 4 variants for chat, translation, and audio, and a few larger or faster models kept around for testing.
A major sub-thread focuses on caching and “re-processing context” in local setups.
Several commenters say the issue is often not the model itself but the harness, append-only message handling, or llama.cpp bugs.
Qwen 3.6’s preserve_thinking support comes up as an important fix for agentic workflows, because older models dropped reasoning traces between turns.
One commenter notes that preserving thinking can improve cache behavior and reduce repeated recomputation in long multi-turn sessions.
There’s also a small technical detour into attention, KV cache behavior, local attention, and how tokenization differences can force recomputation.

My Take

What strikes me is how quickly the conversation moves from “can local models code?” to “can your harness survive real agentic usage?” That feels like the real story. For Claude Code users, this is a useful reminder that model quality is only half the system; the orchestration layer, message formatting, and cache behavior matter a lot once you’re doing tool use in a loop.

I think the most honest takeaway here is that local models are now good enough to be genuinely useful, but they’re still not as forgiving as Claude. The people in this thread who are happy with local setups are doing serious systems work: containers, sandboxing, model selection, quantization tradeoffs, and prompt discipline. That’s exciting if you care about privacy, offline use, or control. It’s less exciting if you just want a coding assistant that “mostly works” without babysitting.

What also stands out is how often the thread turns into debugging the stack rather than discussing the model. That’s not a knock on local LLMs — it’s just reality. If you want the freedom of running your own models, you inherit the whole reliability surface area. Claude Code feels appealing precisely because a lot of that complexity is hidden behind a polished product.

If I were using Claude Code today, I’d still keep local models around for privacy-sensitive or offline tasks, and for testing workflows where I don’t want to burn frontier-model calls. But I’d be cautious about treating local as a full replacement unless I was ready to tune the harness and accept a more hands-on experience.

The big takeaway: local coding models are becoming credible, but the thread makes clear that “fully replacing Claude” is still a very different proposition from “having a workable offline assistant.”