Smaller open LLMs now work for open agents.

April 24, 2026 by Willem van den Ende AI privacy

(No AI was involved in this attempt at ordering my thoughts and making my work legible to others, hopefully you).

I am replacing most, if not all, of my Claude Code workflows with pi.dev, an open source coding agent, and local LLMs running on my laptop. If you don’t have the hardware, smaller models are also cheap(er) to run on hosted services like Open Router. As prices of frontier models continue to rise, and subscription plans are watered down, the capability and speed of open weight and open source models continues increasing. The last month saw a shift, with a couple releases from last week (Qwen 3.6 and several inference servers implementing performance improvements - more speed, less memory) marking a step change in user experience.

I wrote “adjusting” in the description, but maybe reeling is maybe a more accurate description of what is happening. I wrote this more general note, as I plan to write some how-to’s to make decoupling (coding) agents from inference hosting approachable for more people. We’re all still figuring this out, but once you are up and running, you can work with your local agent to improve your tools and way of working in small steps. I started using open weights models and open source coding tools in 2024, and have on and off kept doing that alongside Claude Code. The second half of 2025 saw some strong open agents (most notable pi.dev and Open Code]), and now paired with strong, affordable models I can do serious development work with an agent from the comfort of my own laptop.

Large Language Models are making several kinds of knowledge and ways of working more accessible and shorten some feedback loops. I am not keen on depending on frontier labs who have no customer service whatsoever, and are operating on models that are financially not sustainable. A friend of mine set up an account to use Claude Code, it got shut down without explanation. This seems to be very common at the moment. The customer service consists of a form from the sound of it, and no response.

Claude Code was?

Before I fell into a Claude Code subscription, I used Open Router with a local coding agent (Aider), and spent maybe 25$ in half a year. Admittedly, it was occasional use and smaller bits of work. Open Router serves both open weight and frontier lab models, pricing is transparent. So you can experiment. Enterprise AI use is likely to also become more price sensitive - it is easy to burn through a year’s worth of AI expense in a quarter, as some are now finding out.

I have been using Claude Code for about a year. I noticed last week I started talking in the past tense about it. I hesitated writing a clickbait title: “Claude code was?”. Late november marked a step change in how well claude code worked - the release of the Opus 4.5 model combined with Anthropics’ long running agents paper made that I could brainstorm an idea for a fairly complicated web app iteratively, and then fairly easily and reliably build it. Giving me time to do exploratory testing and focus on user experience.

That also gave me some anxiety. What if my account got pulled for whatever reason? It was clear from the start of the monthly subscriptions (to me and many others. not everyone apparently) that this was not sustainable, and that at some point they would have to raise prices. Anthropic is now putting more and more of their stuff behind per-token pricing. And their models are expensive.

If you have to prompt precisely, you can just as well use a smaller model

At the same time, Opus 4.7 requires more explicit prompting. Probably to compete with OpenAIs models. “Creativity” is nice when you are doing an architectural spike, but less so when you try to do some precise work in a larger codebase. Small open weight models (at least until last week) also require(d?) more precise prompting. Why would you pay for an expensive frontier model + closed source harness when you have to be precise in your prompts? You might as well do the same with a local model.

I ran a small experiment last month with a model that is a lot smaller than what I usually use (4Billion and 9B parameters) and ran an auto-improvement loop on the prompt. If you know what the outcome should be, A larger model can iteratively generate a more detailed prompt that works for smaller models. Since everything runs on my own laptops, and I have cheap green energy at night, I can use this to become more independent of frontier labs, and help others. Note that I did that once, because the larger models now need much less RAM than they did a month ago. So I can keep one running in the background and whenever I have a question or something do build I will just prompt it from my writing environment (editor) or my coding agent.

Rough Timeline

Open weight models already gradually replaced frontier models over time. In late 2024 some of my stack overflow questions could be done by local models. Last year I bought a refurbished macbook (I try to keep the embedded carbon down). Then in the second half of last year, more and more code in mainstream languages started to work, and small things in niche programming languages. In January I could iterate on event storms locally, and get structured output out. In March I could do code analysis on a 500Kloc legacy csharp codebase without much explicit prompting. Last week I managed to develop extensions for a local coding agent, This week I went all in and ported my favourite “skills” over to it. Some of that porting involved Claude Code, but quite a lot of it was done using the coding agent itself and local models.

Why now?

What changed last week was that some open weight models models now run faster on my laptop than Claude Code does, while the quality gap keeps shrinking. Speed has a quality all of its’ own. The “Time To Next Response” doesn’t really matter when you run a job overnight. It does matter when you try to create the job, or iterate on it. I like fast feedback, being able to steer my work quickly. I think a lot of the “I run ten agent sessions in parallel” is just a side effect of inference being throttled - massive models make that more or less unavoidable because of their hardware footprint.

I hoped last year that open weight and open source models would improve. The dynamics are there: small models are cheaper to train (or larger models that are composed of many small models - Mixture of Expert models are now common and getting better), cheaper to run. Cheaper, and often also faster. You an see this if you go to Open Router and look at the TPS (tokens per second) for hosted mixture of experts models and ‘dense’ models (compare a Qqwen 27B with a 35B Mixture of Experts model). Cheaper and faster means more organizations and individuals can participate. Some open weight model releases were not that great. But that doesn’t matter as long as a few of them are. The community figures out what models work well for what use cases, and which inference servers work well on what hardware.

Where to start?

You can try models in the browser on Open Router and Hugging Face, or install “Google Edge” or another app on your phone: that will run a small model right on your phone. And you can see what they are like. Then you can hook up an API (say Open Router, but also can be a frontier model from Anthropic, Google or OpenAI) to a local coding agent. OpenCode is easy to set up out of the box, and comes with a free cloud model, so you are good to go. I didn’t like that, because it defaults to that free cloud model if you make a typo in the local, private model you want. I use Pi.dev at the moment, which is more minimal and “fail fast”. I’ll write some more “how to posts”, because people ask me what my setup looks like.

I had to get the “where I’m at” post out first, apparently. Where are you at?