2026 · Novus Stream Solutions (hub)About 13 min readNovus Stream Solutions
How we standardized on Claude Code to build our apps
After trying GitHub Copilot, Cursor, and the major chat assistants, we ended up using Claude Code as our one coding tool and Claude as our one model. An honest account of how we got there — and why we still rate every one of them.
Overview
We build a portfolio of free web apps as a very small operation, and AI tooling is a big part of how we move fast enough to do that. Over the last while we tried most of what is out there — GitHub Copilot, Cursor, and the major chat assistants — and we have landed somewhere specific: Claude Code as our single coding tool, and Claude as our single model. This is an honest account of how we got there.
Up front, the honest framing: we have only good things to say about modern LLMs. They are all genuinely impressive, and the differences between them are smaller than the online discourse suggests. This is not a story about a bad tool versus a good one. It is a story about which one fit how we work — and for a business trying to move things forward, that turned out to matter a lot.
The path: Copilot, then Cursor, then Claude Code
We started, like a lot of people, with GitHub Copilot — autocomplete-style suggestions in the editor. It was a real productivity boost and a great on-ramp. From there we moved to Cursor, which wrapped the model in a more capable editor and let us work in larger chunks. Both were good tools that we got real value from, and both are easy to recommend depending on how you like to work.
What changed our setup was Claude Code. The shift was from "AI that completes the line I am typing" to "AI that can take a described task and carry it through" — reading the codebase, making coordinated edits across files, running the build, and reporting back. That agentic flow matched how we actually think about work — in tasks and outcomes, not lines — and over time it became the tool we reached for by default. We consolidated onto it rather than splitting attention across several.
The model decision: it came down to ChatGPT vs Claude
Choosing a coding tool is partly choosing a model, so we did our own research across the major platforms instead of going by reputation. For our work it came down to two: ChatGPT and Claude. Both are excellent, and for plenty of tasks either would have been a fine answer — this was not a landslide on raw capability.
Claude won for us, and the deciding factor was honest about itself: Claude Code. The agentic coding experience built around the model was the thing we wanted, and it tied the model choice to the tool choice. We have not tested Codex, so we will not pretend to a verdict there — but for how we build, we think Claude is the best fit, and that is the bet we made.
They are similar — but not identical
One thing our own testing made concrete: these models give similar, but slightly different, answers. Ask the same question of two strong models and you will usually get information that overlaps heavily but diverges in the details, the framing, and occasionally the accuracy. The differences are small most of the time and meaningful some of the time.
A simple example from our experience: we put the same questions to Gemini Pro and to Claude Sonnet. The information they returned was broadly similar — both are very capable. But we found the more powerful Claude models more willing and able to catch errors carried over from outdated information, and to feel a bit more current. That kind of "notices when the old answer is wrong" behavior is exactly what you want when you are relying on a model to build real software, and it nudged us toward Claude over time.
Why an agentic flow fits a business
Here is the part that decided it for us as a business rather than a hobby. The big chat assistants — ChatGPT and Gemini among them — are phenomenal generalists. They are brilliant for thinking through a problem, drafting, explaining, exploring. We use that kind of capability all the time, and we would not be without it. They are all great, and they are great for different things.
But moving a business forward is less about generating an answer and more about getting work done across a real codebase: implement the feature, fix the regression, ship it. Claude Code's agentic flow is built around that — it does not just tell you what to do, it does it, with you reviewing. For a small team trying to keep a portfolio of apps moving, a tool that completes tasks end-to-end is a different kind of leverage than a tool that produces great text. That orientation toward doing the work is why it became our standard.
Why a tiny operation needs this kind of leverage
The context for all of this is that we are a very small operation building and maintaining a portfolio of free web apps, which means the constraint we are always working against is how much one or a few people can actually move forward. A small team cannot brute-force its way through a large amount of engineering work the way a staffed company can; the only path is leverage — getting more done per person — and AI tooling is one of the largest sources of that leverage available right now. This is why the choice of tools is not a casual preference for us but a decision that materially affects how much we can ship.
Seen that way, the standardization is really about maximizing leverage rather than about tool tribalism. A tool that completes tasks end to end, across a real codebase, gives a small team the kind of throughput that would otherwise require more hands, which is precisely the multiplier a tiny operation needs to maintain a portfolio of apps. The stakes of the tooling choice are higher for us than they would be for a hobbyist precisely because the operation depends on the leverage to exist at all. We are not optimizing a workflow at the margins; we are choosing the foundation that determines whether a small team can keep a real software operation moving, which is why we put genuine research into it.
Test on your own work, not on benchmarks
The single most useful piece of advice we can offer from this process is to evaluate these tools on your own actual work rather than on benchmarks, leaderboards, or online hot takes. The differences between the strong models are subtle and often depend heavily on the specific kind of work you do, which means a benchmark that measures one thing may tell you little about how a tool performs on your codebase, in your domain, with your patterns. We learned far more from putting the same real tasks to different tools and comparing the results than we ever could have from published comparisons.
This matters because the discourse around these tools is loud, opinionated, and frequently disconnected from the experience of using them on real projects. Benchmarks capture narrow, standardized tasks; hot takes capture whoever is most vocal; neither captures whether a tool fits how you actually build. By testing on our own work, we got an answer grounded in our reality rather than someone else's, and that answer — Claude and Claude Code for our particular needs — might be different for a team with different work. The general lesson is to trust your own evaluation over the noise: run the tools on the tasks you actually face, and standardize on the one that genuinely fits, because that fit is what determines the leverage you get, not the leaderboard position.
The review discipline behind shipping AI-written code
Adopting an agentic tool that writes code raises an obvious question — how do you ship code you did not write line by line responsibly — and the answer is a review discipline that is as important as the tool itself. The model writes, but a human reviews, and that review is not a formality; it is where correctness, fit, and quality are actually assured. The agentic flow is powerful precisely because it does the work and reports back, but the reporting-back is the point at which a person checks that the work is right before it ships. The tool is a capable worker, not an unsupervised one.
This approver model is what makes relying on AI-written code safe for real software rather than reckless. The leverage comes from the AI doing the heavy lifting of implementation, and the safety comes from a human holding the judgment about whether what was implemented is correct and appropriate. Neither half works alone: AI without review risks shipping plausible-looking mistakes, and review without the AI doing the work gives up the leverage. The combination — AI writes, human reviews — is the discipline that lets a small team move fast without sacrificing the correctness that real users depend on. Getting that balance right is a large part of why the agentic flow works for us as a business rather than just as a novelty, because it pairs the speed with the accountability that shipping real software requires.
Why we still rate the tools we left
It would be easy to read a standardization story as a verdict against the tools not chosen, but that is genuinely not our position, and the honesty matters. The tools we moved through — autocomplete-style assistance, the editor-wrapped approach, the major chat assistants — are all genuinely good, and each gave us real value at the time we used it. Our path through them was not a series of disappointments but a progression, each step useful, leading to a setup that happened to fit us best. The fact that we consolidated onto one does not diminish the others; it reflects what matched our specific way of working.
We continue to respect and occasionally reach for the tools we did not standardize on, because they remain excellent at what they do. The big chat assistants are phenomenal generalists for thinking through problems, drafting, and exploring, and we use that capability regularly even while building with an agentic flow. The point is not that one tool is good and the others bad but that different tools fit different needs, and ours led us to a particular combination. Anyone reading this should take it as an account of a fit, not a ranking — the right choice for a different team, with different work and different preferences, could easily be one of the tools we did not standardize on, and that would be a perfectly good answer.
Standardizing is a focus decision, not a final verdict
It is worth being clear about what standardizing means and does not mean, because the word can sound more final than it is. Choosing one tool and one model is a decision to focus — to stop splitting attention and stop perpetually re-evaluating — rather than a permanent verdict that nothing will ever change. The landscape of these tools moves quickly, and we hold our choice with the awareness that it could be revisited if something genuinely shifted the calculus. Standardizing buys focus now; it does not foreclose change later.
This framing keeps the decision from becoming either dogma or a source of anxiety. We are not claiming our choice is the eternal best or that we have stopped paying any attention to the field; we are saying that, for the foreseeable present, the focus gained from committing to one capable setup outweighs the option value of keeping everything open. If the situation changes meaningfully, we can re-evaluate — on our own work, as before. But until then, the value is in having settled the question and directed our energy at building rather than at choosing. Standardization is a deliberate trade of optionality for focus, made consciously, and understood as a current decision rather than a closed door, which is exactly how a small operation should treat a fast-moving but important tooling choice.
How this fits the lab operating model
This tooling choice is not separate from how the whole operation runs; it is an expression of the same operating model that governs everything we build. We run as an app-testing lab — ship fast, measure honestly, iterate or cut — and that requires the ability to move quickly across a portfolio of apps with very few people, which is exactly the capability the agentic flow provides. The tool that lets a small team implement, fix, and ship across real codebases is the engine that makes the build-fast half of the operating model possible at the scale we run.
There is also a philosophical alignment worth noting: we favor approaches that give a small operation leverage without heavy ongoing cost, which is the same instinct that leads us to build apps that run almost for free on the user's device. An AI tool that multiplies what a few people can ship fits that instinct, because it is leverage rather than headcount. The way we build our apps and the way we choose our tools come from the same place — maximizing what a tiny, lean operation can accomplish. Seeing the tooling decision as part of the operating model rather than a separate technical choice explains why we took it seriously: it is one of the foundations that makes the entire lean, fast, portfolio-based way of operating actually work.
The setup we landed on
So today it is simple: one coding tool, Claude Code, and one model, Claude. Standardizing removed the overhead of context-switching between tools and the temptation to chase whichever model was trending that week. We still respect — and occasionally reach for — the others; ChatGPT and Gemini and the tools we used before are all genuinely good, and the right choice for someone else may be different.
For us, though, the combination of a capable model and an agentic flow tuned to actually moving work forward is what fit a small business building real apps. If you are evaluating these tools, our honest advice is to test them on your own work rather than on benchmarks or hot takes — the differences are subtle, and the one that fits how you work is the one worth standardizing on.