Field guideOperating model

2026 · Operating modelAbout 13 min readNovus Stream Solutions

What an "app testing lab" actually does: our build → ship → measure → keep-or-kill loop

Novus Stream Solutions calls itself an app testing lab, and the label is operational, not marketing. The full loop — build narrow, ship into real usage, measure activation, and decide to keep, kill, or double down — explained.

The build, ship, measure, and keep-or-kill loop of an app testing lab

Overview

Novus Stream Solutions describes itself as an app testing lab, and that phrase does real work — it is how product decisions actually get made, not a slogan on an about page. An app testing lab builds small, useful digital products, ships them into real usage conditions, measures what happens, and decides what to grow and what to move on from. The whole operating model is a loop, and naming each part of it precisely is the difference between "we make apps" and a system that consistently produces a small number of products worth keeping. This post walks the full loop — build, ship, measure, keep-or-kill — and explains why the part everyone finds uncomfortable, killing things, is the part that makes the rest work.

The reason a small operation needs this loop more than a large one does is constraint. A big company can afford to carry products that are not working for a long time; a one-person operation cannot, because every product that stays alive consumes attention and infrastructure that a better product could have used. The lab model is how scarce attention gets allocated by evidence rather than by attachment — you build a thing, you find out whether it is real, and you act on the answer. That cycle, run honestly, is what keeps the portfolio small enough to maintain well.

Build: the narrowest version of the useful thing

The loop starts with building, but the discipline is in how narrow the build is. Every product begins as the smallest version that delivers genuine value on its own — not a feature-complete vision, but the core useful thing, scoped down to what can be shipped and tested quickly. For the visualizer that meant upload, customize, export, and nothing else until that loop worked; for the background remover it meant removing a background cleanly and exporting it correctly before anything else. The point of building narrow is not to do less work; it is to reach the measuring stage faster, because until a product is in front of real users, every belief about whether it is good is a guess. Narrow builds shorten the distance to evidence.

Building narrow also protects against the most expensive mistake a small operation can make: pouring months into a product before discovering nobody wants it. A narrow build that ships in weeks risks weeks; a comprehensive build that ships in months risks months, on a bet that has not been tested. The lab model treats the first version as a question to be answered, not a product to be perfected, which is why the scoping is ruthless. You build exactly enough to ask the question "does anyone actually use this" and no more, because everything beyond that is investment in an answer you do not have yet.

Ship: into real usage, not a focus group

Shipping means putting the product into real usage conditions, which is categorically different from testing it in a controlled setting. The most reliable signal about whether a product works is what people actually do with it when it is live and they have a real reason to use it — not what they say in a survey, not how it performs in a demo, not internal opinion. Real usage surfaces the things no one predicted: the step that seems obvious to the builder and confuses every new user, the feature that was meant to be secondary that turns out to be what people reach for first, the failure mode that only appears under the messiness of real inputs. None of that is visible before shipping, which is why shipping is the step that converts beliefs into evidence.

The willingness to ship before everything is perfect is what makes the lab a lab rather than a workshop. There is always a temptation to keep polishing in private until the product feels ready, and that temptation is almost always a way to avoid the discomfort of finding out. The lab model accepts that the first version will have rough edges in places no one anticipated, because discovering those edges in production with real users is more valuable than a longer build that still could not have predicted them all. Shipping is not the end of building; it is the start of learning, and the learning is the point.

Measure: activation, not vanity

Measuring is where the loop either produces real decisions or degenerates into self-congratulation, and the difference is what you measure. The metric that matters is activation — did the user complete the core workflow the product exists to deliver — not the vanity metrics that look good on a dashboard. A signup is not activation; a visit is not activation; time-on-page in isolation is not activation. The question is whether the person who arrived actually did the thing: removed the background, exported the video, completed the loop. A product with a high activation rate among the people who try it is healthier than a product with ten times the traffic and a low activation rate, even though the second one looks bigger, because activation measures whether the product works and traffic only measures whether people arrived.

Holding to activation as the primary measure is a discipline because vanity metrics are seductive precisely when you most want reassurance. When a product is struggling, the signup count or the pageview number is the comforting number to look at, and it is the wrong one — it tells you people found the door, not that the room was worth entering. The lab measures whether users complete the core action because that is the signal that actually predicts whether a product deserves continued investment. Everything else is context at best and self-deception at worst, and the measuring step is only worth doing if its output is a number that can actually change a decision.

A high-activation tool versus a high-traffic low-activation tool, showing why activation is the real signal
A 90%-activation tool beats one with 10× the signups and 15% activation — activation measures whether the product works.

Why the lab framing is operational, not branding

The phrase "app testing lab" is worth taking literally, because the word lab implies something specific that distinguishes this model from how most product work is described. A lab runs experiments — structured attempts to answer a question, where the outcome is genuinely uncertain and the point is to learn the answer rather than to confirm a foregone conclusion. Framing product work as experiments rather than launches changes the relationship to the outcome: an experiment that returns a negative result has succeeded at its job of producing an answer, whereas a launch that fails is simply a failure. The lab framing builds the possibility of a negative result into the model from the start, which is what makes acting on negative results feel like the process working rather than the process breaking.

This is not a comforting reframe applied after the fact; it is the operational stance that makes the whole loop function. If each product is a launch you are committed to making succeed, then evidence that it is not working is threatening and gets rationalized away, which is how operations end up carrying products that should have been cut. If each product is an experiment whose job is to produce an answer, then evidence that it is not working is just the answer the experiment was run to get, and acting on it is completing the experiment. The lab metaphor is operational because it sets the expectation — going in — that some experiments return negative, and that returning a clear negative is a success of the method. That expectation is what allows the honest measurement and the willingness to cut that the model depends on.

Why scarcity makes the loop mandatory

The build-ship-measure-decide loop is good practice for any operation, but for a small one it is not optional — it is forced by scarcity in a way that a large operation does not feel. A big company has enough resources to carry underperforming products for a long time, to run many bets simultaneously without each one competing hard for attention, and to absorb the cost of indecision. A one-person operation has none of that slack: every product that stays alive consumes a meaningful share of the total attention and infrastructure available, so carrying a product that is not working is not a minor inefficiency but a direct theft of resources from the products that are.

This scarcity is why the loop has to be run rigorously rather than loosely. The decisions it produces — what to keep, what to cut — are decisions about how to allocate the single most constrained resource the operation has, which is the founder's attention. A loop that does not actually produce decisions, or that produces them and then does not act on them, lets attention pool in the wrong places, and at small scale there is no surplus attention to compensate. The discipline of the loop is therefore a direct response to the constraint: because attention is scarce, it must be allocated by evidence, and the loop is the mechanism that turns evidence into allocation. The bigger you are, the more you can get away with running this loosely; the smaller you are, the more rigorously it has to run, which is why a solo operation needs it most.

Why real usage beats every other signal

The insistence on shipping into real usage rather than relying on controlled testing rests on a hard truth about product signals: what people actually do when they have a real reason to use a live product is categorically more reliable than what they say, predict, or do in an artificial setting. Surveys capture stated preferences, which diverge from behavior; demos capture performance under ideal conditions, which real inputs do not respect; internal opinion captures the builder's blind spots, which are exactly the things real users expose. Real usage is the only signal generated by people pursuing their own genuine goals with the actual product, which is why it surfaces the truths the other methods systematically miss.

The things real usage reveals are specifically the unpredictable ones. The step that the builder finds obvious and every new user finds baffling does not show up in a demo run by the builder; it shows up when strangers hit it cold. The feature meant to be secondary that turns out to be the main draw does not show up in a plan; it shows up in what people actually reach for. The failure mode that only appears under the messy diversity of real inputs does not show up in clean testing; it shows up in production. Because the most valuable lessons are precisely the ones no one predicted, and unpredicted lessons by definition cannot be captured by methods that test predictions, real usage is irreplaceable. Shipping is not a step toward learning; for the lessons that matter most, it is the only way to learn them at all.

The seduction of vanity metrics

The reason activation has to be defended as the primary metric is that vanity metrics are seductive precisely when judgment is most compromised, which is when a product is struggling. A signup count, a pageview number, a follower total — these are easy to grow, pleasant to look at, and comforting to point to when the harder question of whether the product actually works is returning an uncomfortable answer. The danger is not that vanity metrics are meaningless but that they are meaningful enough to feel like real signal while measuring the wrong thing: they measure whether people arrived, not whether the product delivered, and those are very different questions with very different implications.

Holding to activation requires resisting that seduction deliberately, especially in the moments it is strongest. When a product is doing well, you can afford to look at any metric because they all point the same way; when it is doing poorly, the temptation is to find the metric that still looks good and attend to that one, which is exactly when discipline matters. The metric that predicts whether a product deserves continued investment is whether the people who try it complete its core purpose, and that metric is sometimes uncomfortable, which is the point — a measurement that can only ever reassure you is not a measurement, it is a decoration. The lab measures activation because it is the number that can actually change a decision, and a metric that cannot change a decision is not worth the dashboard space, however good it looks.

The loop never actually stops

A subtle but important property of the build-ship-measure-decide loop is that it does not end when a product is kept — the loop keeps running on every product in the portfolio, continuously, because a product that earns its place today has to keep earning it. Keeping a product is not a permanent verdict but a current one, and the same measurement that justified keeping it continues, so that a product which was healthy and later declines is caught by the ongoing loop rather than coasting indefinitely on a past decision. The loop is a standing process applied to the whole portfolio over time, not a one-time gate each product passes once and is then exempt from.

This continuous quality matters because products and their contexts change. A tool that was worth keeping can lose relevance, develop problems, or stop being used as alternatives emerge, and an operation that only evaluated products once would carry these declining products on the strength of a stale judgment. Running the loop continuously means every product is always being asked whether it still earns its place, which keeps the portfolio honest over time rather than just at launch. It also means the retirements that happen are not failures of the original keep decision but the loop doing its ongoing job — a product that was rightly kept and later rightly retired has been correctly handled at both moments by the same continuous process. The loop never stops because the question it answers — is this still worth maintaining — never stops being relevant, and a portfolio kept sharp is one where that question is always being asked.

Keep, kill, or double down — and why killing is the system working

The loop closes with a decision, and there are only three: keep the product running as it is, double down and invest more because the evidence is strong, or kill it because it has not earned its place. The criteria are deliberately simple — the product does what it claims, users can complete its core workflow without significant hand-holding, and there is evidence of real usage rather than just initial curiosity. A product that meets all three earns continued investment. A product that consistently fails one of them gets a defined window to improve and then a real decision, not an indefinite life on the strength of the effort already spent in it. The decision is made on the evidence the measuring step produced, which is the entire reason the measuring step has to produce real signal.

Killing a product is the part that feels like failure and is actually the system working correctly. A product that does not earn its place should not continue consuming the attention and infrastructure that the rest of the portfolio needs, and the honest acknowledgment that something did not work — followed by redirecting that energy — is what keeps the surviving products sharp. The sunk cost of having built something is not a reason to keep it; it is already spent, and keeping a weak product alive to honor it just spends more. Every product currently in the Novus portfolio has passed its own version of this evaluation, which is precisely why the portfolio is small enough to maintain each one well. The lab succeeds not by keeping everything it builds, but by being willing to cut what does not work so that what does can get the attention it deserves. The companion posts cover the lean cost structure that lets the lab run many experiments and the free-first model the products are monetized under.