2026 · NSS Background RemoverAbout 13 min readNovus Stream Solutions
When patches compound, rebuild: the queue and worker overhaul
An engineering retrospective on the moment we stopped patching a class of silent failures in the Background Remover and rebuilt the execution model from a clean foundation — per-job worker isolation and one canonical queue state machine.
Overview
Every codebase eventually faces the same decision in miniature: the current approach is producing a recurring class of bug, and you can either patch this specific instance again or admit the approach itself is wrong and rebuild it. Patching is always cheaper this week and frequently more expensive over the next month, because the patches compound — each one adds a special case, the special cases interact, and eventually you are maintaining a pile of fixes that is harder to reason about than the thing they were fixing. This is the story of one of those moments in the Background Remover, what the recurring failure actually was, and why rebuilding the execution model was the right call even though patching would have been faster.
The honest framing matters here, so one note up front: this is an engineering retrospective sourced from the product's real version history, not a war story with invented drama. The symptoms, the root cause, and the fix are all real and documented in the changelog. What makes it worth writing is the lesson, which generalizes well beyond background removal: the signal that it is time to rebuild rather than patch is when your fixes start interacting with each other.
The symptom: tools that failed silently
The failure presented in the most frustrating possible way: silently. A user would run a tool — background removal on the Best Quality model, a video job, the live camera — and instead of an error, the operation would simply not work as expected. No crash, no clear message, just a result that did not happen or a tool that appeared to do nothing. Silent failures are the worst category to diagnose because there is no stack trace pointing at the problem; the system is not throwing an error, it is quietly producing the wrong outcome. From the user's side it just looks like the tool is broken and unreliable, which is corrosive to trust in a way that a clear error message is not.
Underneath, the cause was a memory-lifecycle issue specific to running models in WebAssembly. A model inference session was being reused across jobs, and when a stale session was not properly disposed, it corrupted the WebAssembly heap. The next operation that touched that corrupted memory would misbehave — sometimes failing, sometimes producing garbage — with no clean error because heap corruption does not announce itself politely. The companion case study traces the diagnosis of this bug in detail; what matters for the rebuild story is that it was not a one-location bug. The same session-reuse pattern existed anywhere a model was run.
The patch trap
The tempting fix, and the first one attempted, is local: find the place where the session was not disposed and add the disposal call. That works — for that location. But the pattern of reusing a model session and needing to manage its lifecycle existed across multiple tools, each with slightly different code. Patching them one at a time meant the same conceptual bug could reappear in any tool that had not yet been touched, and worse, each patch was a slightly different special case bolted onto a slightly different queue implementation. The tool had grown several queue stores over time — one for the main queue, one for video, one for batch, one for GIFs, and more — and they had drifted apart, each handling cancellation, errors, and worker lifecycle in its own way.
This is the compounding-patch trap in its natural habitat. Each individual fix was reasonable. Collectively, they were producing a system where the same class of bug could hide in any of half a dozen subtly different queue implementations, and where fixing it everywhere meant making the same edit six different ways and hoping you found all the places. The maintenance cost was not in any single patch; it was in the divergence the patches were papering over. When you find yourself making the same fix repeatedly in slightly different forms, the fix is not the problem — the architecture that requires the fix in six places is.
The rebuild: isolation and one canonical queue
The decision was to stop patching the symptom and rebuild the execution model so the bug class could not exist. Two changes did the work. First, per-job worker isolation: instead of sharing a long-lived model session across jobs, every job spawns a fresh Worker that is hard-terminated on completion or failure. If there is no shared session that survives between jobs, there is no stale session to leave undisposed, and a corrupted worker is thrown away rather than reused. The entire category of "stale session corrupts the next job" simply stops being possible, because nothing carries over between jobs to be stale.
Second, the divergent queue stores were brought to one canonical state machine. Every queue — main, video, batch, GIF — was reconciled to the same explicit set of states and the same cancellation and error behavior. A cancel now means the same thing everywhere: set the item to cancelled, stop processing, and do it without leaving the worker in a half-killed state. Instead of six implementations each with their own subtle handling, there is one model of how a job moves through the system, applied consistently. That consistency is what makes the system reasonable to maintain: a fix or an improvement to the queue behavior now applies once, correctly, everywhere, instead of needing to be reimplemented per tool.
How the queue stores had drifted apart
To understand why the rebuild was necessary, it helps to picture the state the codebase had drifted into, because it is a state most growing projects reach. The tool had accumulated several separate queue implementations over time — one for the main flow, one for video, one for batch, one for GIFs, and more — each added when a new tool needed queuing, each written in the moment to handle that tool's needs. Individually, each was reasonable. Collectively, they had diverged: they handled cancellation differently, reported errors differently, managed worker lifecycle differently, so that the same conceptual operation behaved subtly differently depending on which queue you happened to be in.
This divergence is the natural result of solving the same problem repeatedly under local pressure rather than once with a shared design. Nobody decided to have several inconsistent queues; they arose from each tool getting its own queue at the time it was built, with no force pulling them toward consistency. The cost was invisible until a bug — the session-corruption issue — needed to be fixed across all of them, at which point the divergence meant fixing it correctly required understanding and editing each queue's idiosyncratic handling separately. The drifted-apart state is what turned a single conceptual fix into a multi-front effort, and recognizing that the divergence itself was the problem, more than any individual queue, is what pointed toward consolidation as the real solution.
What "canonical" actually bought
Bringing the divergent queues to one canonical state machine was the core of the rebuild, and it is worth being concrete about what that consolidation actually bought beyond tidiness. With one model of how a job moves through the system — the same states, the same transitions, the same cancellation and error behavior everywhere — a fix or an improvement to queue behavior now lands once and applies correctly across every tool, rather than needing to be reimplemented per queue with the attendant risk of missing one or doing it slightly differently. The canonical model converts a class of work from "do this edit in six places and hope" into "do this edit once."
This is the durable value that justifies the cost of the rebuild. The divergent queues were not just inconsistent for users; they were a multiplier on the cost of every future change to queue behavior, because each change had to be made several times. Consolidating to one canonical state machine removed that multiplier permanently, so the rebuild paid for itself not only in fixing the immediate bug across all tools but in making every subsequent queue-related change cheaper and safer. The consistency users experience — every tool's queue behaving the same way — is the visible face of an internal change whose deeper payoff is that the queue is now reasoned about and maintained in one place rather than six. Canonical means single source of truth, and a single source of truth is what makes the system tractable.
The real cost of choosing to rebuild
It would be dishonest to present the rebuild as free, because it was not, and being clear about its cost is what makes the decision instructive rather than glib. Rebuilding the execution model — introducing per-job worker isolation and reconciling several divergent queues to one canonical model — was substantially more work than the local fix of adding a disposal call where the bug surfaced. It meant touching multiple tools, changing how jobs are executed across the suite, and accepting the risk that comes with any significant structural change. In the short term, the patch was cheaper, faster, and less risky, which is exactly why the temptation to patch is always strong.
The rebuild was the right call despite costing more precisely because the cost was one-time while the patch's cost was recurring and compounding. Another patch would have fixed this instance and left the architecture that breeds the bug intact, guaranteeing future instances and future patches, each adding a special case to an already-divergent system. The rebuild ended the recurrence by making the bug class structurally impossible and the queues uniform, which is a permanent return on a one-time cost. The honest accounting is that rebuilding cost more that week and less over every month after, which is the calculation that should drive the patch-versus-rebuild decision — not which is cheaper now, but which is cheaper over the life of the system. When the patch's costs compound, the more expensive rebuild is the economical choice.
The tradeoff of a worker per job
Per-job worker isolation is the structural fix that made the bug class impossible, but it is not free, and being honest about its tradeoff is part of understanding why it was still the right choice. Spawning a fresh worker for every job and tearing it down afterward costs something — there is overhead in creating a worker, loading what it needs, and disposing of it, compared to reusing one long-lived worker across many jobs. A design optimized purely for raw throughput might keep a worker alive and reuse it, accepting the lifecycle-management burden in exchange for avoiding the spawn cost. The isolation model deliberately pays the spawn cost to buy something more valuable.
What it buys is the elimination of an entire class of bug, and that trade is worth it because correctness and reliability matter more here than shaving the spawn overhead. The session-corruption bug came precisely from sharing state across jobs; isolating each job in a disposable worker removes the shared state and therefore the bug, at the price of some per-job overhead that is negligible next to the inference work itself. The tradeoff is favorable because the cost is small and the benefit is structural: a guarantee that no job can corrupt another. Choosing isolation over reuse is choosing reliability over a marginal performance gain, which is the right priority for a tool people depend on. The overhead is real but modest, and what it purchases — a bug class that can no longer exist — is exactly the kind of durable guarantee worth a small recurring cost.
When patching is genuinely the right call
A rebuild-over-patch story risks implying that rebuilding is always right, which would be the wrong lesson, so it is worth being clear about when patching is genuinely the correct choice. A patch is right when the problem is truly local — a one-off bug that does not reflect a pattern, a mistake confined to a single place with no structural cause — because then the local fix addresses the actual root and a rebuild would be over-engineering. Most bugs are like this, and reaching for a rebuild every time would be its own pathology, churning architecture in response to problems that a targeted fix resolves completely. The default should be the proportionate fix, which is usually the patch.
The signal that tips the balance toward rebuilding is specific: when you find yourself making the same fix repeatedly in slightly different forms, because the architecture forces the bug to exist in many places. That recurrence is the evidence that the problem is structural rather than local, and that patching is treating symptoms of a deeper cause. The session-corruption bug crossed that line because the pattern of manual session lifecycle existed throughout the codebase, so any single patch left the others exposed. The discipline is to patch by default and rebuild only when the pattern of repeated, similar fixes reveals a structural cause — not to rebuild reflexively, but to recognize the specific signal that local fixes have stopped being sufficient. Knowing which situation you are in is the actual skill; the rebuild was right here because the signal was clearly present, not because rebuilding is generally superior.
The all-tools doctrine as standing policy
The second lesson from the rebuild hardened into a standing policy worth stating plainly: when you fix a class of bug, audit every tool that shares the pattern, not just the one that reported the failure. The session bug surfaced in one place but lived in many, and a fix that touched only the reported location would have guaranteed the same bug resurfacing elsewhere later. Bringing every queue to one canonical model was the all-tools version of the fix — addressing the pattern across the whole suite rather than playing whack-a-mole as each tool surfaced the same issue in turn. That doctrine now applies whenever a fix reveals a pattern.
The value of making this an explicit policy rather than a one-time response is that it changes how every fix is approached. When a bug is found, the question is no longer just "where did this happen" but "where else does this pattern live," which converts reactive, repeated firefighting into proactive elimination of the whole class. This is more work upfront for any single fix, but it prevents the slow drip of the same bug reappearing in untouched corners, which over time is far more expensive. The all-tools doctrine, the rebuild-when-patches-compound principle, and the per-job isolation that came from this one bug together form a small body of hard-won policy that shapes how the suite is maintained — which is the real payoff of taking an engineering retrospective seriously rather than just closing the ticket and moving on.
The lesson: rebuild the approach, then apply it everywhere
The rebuild produced two durable lessons that now shape how the whole tool suite is maintained. The first is the one in the title: when patches compound, rebuild the approach. The trigger is not a single hard bug — it is the recognition that you are making the same fix repeatedly because the architecture forces the bug to exist in many places. At that point the cheapest long-term move is to change the architecture so the bug class is impossible, even though it costs more this week than another patch would. Per-job worker isolation was more work than adding one more disposal call, and it was the right call precisely because it ended the recurrence instead of postponing it.
The second lesson is about scope, and it became a standing doctrine: when you fix a class of bug, audit every tool that shares the pattern, not just the one that reported the failure. The session-disposal bug surfaced in one place but lived in many, and a fix that touched only the reported location would have guaranteed a repeat. Bringing all the queue stores to one canonical model was the all-tools version of the fix — addressing the pattern across the whole suite rather than playing whack-a-mole as each tool surfaced the same issue in turn. Those two principles, rebuild-when-patches-compound and fix-across-all-tools, are worth more than the specific bug they came from, which is the real payoff of taking an engineering retrospective seriously instead of just closing the ticket.