2026 · NSS Background RemoverAbout 13 min readNovus Stream Solutions
Diagnosing a silent failure: the ONNX worker-session bug that broke tool execution
A debugging case study: tools that failed silently with no error, traced to a model session that was not being disposed and corrupting the WebAssembly heap — and the per-job worker isolation that fixed it for good.
Overview
The worst bugs are the ones that do not announce themselves. A crash gives you a stack trace and a line number. A thrown error gives you a message to search for. A silent failure gives you nothing — the system simply produces the wrong outcome, no error, no log, no obvious place to start. This is a case study of exactly that kind of bug in the NSS Background Remover: tools that stopped working with no error at all, what it took to trace it, what the root cause turned out to be, and the fix. If you build anything that runs models in WebAssembly, this is a wall you may hit, and the shape of it is worth knowing in advance.
Everything here is drawn from the product's real version history rather than reconstructed for narrative effect; the symptom, the root cause, and the fix are documented in the changelog. The reason it is worth writing up is that the failure mode is genuinely instructive — it sits at the intersection of machine learning, browser memory management, and worker lifecycles, which is a corner most web developers never have to think about until something exactly like this forces them to.
The symptom: nothing happened, and nothing said why
The report was vague in the way silent failures always are: tools were "not working." Not crashing, not erroring — just not producing results. The Best Quality model would be selected, a job would run, and the output would not appear, or a tool that had worked a moment ago would quietly stop responding to execution. From the user's perspective the tool was simply unreliable, which is corrosive in a way a clear error is not: a clear error says "something specific went wrong," while a silent failure says "this tool does not work," and users generalize the second one to the whole product.
The first diagnostic problem was reproduction. Silent failures are often state-dependent — they happen after some sequence of prior operations rather than on a fresh load — which makes them maddening to reproduce on demand. The behavior pointed at something accumulating across operations rather than a clean, first-try bug: the tool worked initially and degraded, which is the fingerprint of a resource or memory issue building up over a session rather than a logic error in any single code path.
The trace: following the degradation to memory
The "works then degrades" pattern is the key clue, because it rules out whole categories of bug and points squarely at state that persists between operations. In a tool that runs ML models, the most expensive persistent state is the model inference session itself. The investigation centered on the lifecycle of that session: when it was created, when it was reused, and — the eventual culprit — when it was supposed to be released and was not. The model sessions run against a WebAssembly backend, and WebAssembly manages memory in a linear heap that the application is responsible for using correctly. A session that is finished with but not disposed does not just waste memory; under the right conditions it leaves the WebAssembly heap in a corrupted state.
Once memory corruption is on the table, the silent nature of the failure makes sense. A corrupted heap does not raise a clean exception at the point of corruption — it produces wrong behavior later, when some unrelated operation reads or writes the affected memory. That temporal and spatial gap between cause (an undisposed session) and effect (a different tool failing to produce output) is exactly why the failure had no useful stack trace: by the time anything went visibly wrong, the actual mistake was several operations in the past. The fix had to address the cause, because chasing the symptom would have been chasing a different downstream effect every time.
The root cause: a session that was never disposed
The root cause was a stale model session that was not being disposed. When a new job needed a session and a previous one was still lingering undisposed, the leftover session corrupted the WebAssembly heap, and the next operation to touch that memory misbehaved. The immediate, local fix was direct: detect the stale session and explicitly dispose it before proceeding, calling the session's disposal in the stale-session detection path of the model wrappers, and resolve the model's input key deterministically by reading the session's input names immediately after load rather than assuming them. That stopped the specific corruption.
But the local fix exposed the deeper problem, which is the part worth internalizing. The pattern of "reuse a model session and manage its lifecycle by hand" existed wherever the tool ran a model — not just in the one place that surfaced the failure. A disposal call added in one wrapper fixed that wrapper; it did nothing for every other tool with the same pattern. The bug was not a typo in one function. It was a consequence of an architecture that required correct manual session lifecycle management in many independent places, which is a standing invitation for the same bug to reappear anywhere someone forgets the disposal.
Why heap corruption is the worst kind of bug
It is worth dwelling on why this particular failure was so hard to pin down, because the category — memory corruption in a linear heap — has properties that make it uniquely nasty to debug. When a stale session corrupted the WebAssembly heap, the corruption did not cause an immediate, localized failure at the point of the mistake; it left the heap in a bad state that some later, unrelated operation would stumble into when it happened to read or write the affected memory. The cause and the visible effect were separated in both time and code location, so the stack trace at the moment of failure pointed at innocent code that was merely the victim of corruption committed earlier elsewhere.
This temporal and spatial separation is what defeats the normal debugging instinct of looking at where the error appeared. In a heap-corruption bug, where the error appears is almost never where the bug is, so following the stack trace leads you in circles. The behavior is also state-dependent — it only manifests after a particular sequence of prior operations leaves the heap corrupted — which makes it intermittent and hard to reproduce on demand. These properties together explain why the failure presented as a vague, intermittent "tools sometimes stop working" rather than a clean, reproducible crash, and why diagnosing it required reasoning about the system's memory lifecycle rather than reading any single error. Heap corruption is the bug that hides the furthest from its own symptom.
Reproducing a state-dependent failure
The first real obstacle in fixing the bug was reproduction, because a failure that depends on accumulated state cannot be triggered reliably by a single fresh action. The tool worked on a clean load and degraded only after some sequence of operations, which is the signature of a resource or memory problem building up over a session rather than a logic error in any one code path. That meant the path to a reproduction was not "do this one thing" but "do this series of things in this order until the state goes bad," which is far harder to discover and far slower to iterate on than a deterministic bug.
This is a general truth about state-dependent bugs worth internalizing: the difficulty is often less in understanding the fix than in reliably triggering the failure so you can confirm you have fixed it. The "works then degrades" pattern is itself the most valuable clue, because it rules out whole categories of bug and points specifically at state that persists between operations — which is what narrowed the investigation toward the model session's lifecycle. Recognizing that pattern as a fingerprint of accumulating state, rather than chasing the symptom wherever it surfaced, is what converted an intermittent mystery into a directed search for what was persisting and corrupting between jobs. The clue was in the shape of the failure, not in any individual instance of it.
What a managed runtime would have done for free
A useful way to understand this bug is to notice that it is exactly the class of problem a managed server runtime would have handled invisibly, which is part of the tax you pay for moving computation to the client. On a managed platform with automatic memory management and process isolation, a leaked or undisposed resource is often cleaned up by the runtime or contained within a process that gets recycled, so the same mistake either does not corrupt anything lasting or is bounded to a single request. The developer rarely has to think about the manual lifecycle of native resources because the platform thinks about it for them.
Running models in WebAssembly in the browser removes that safety net for the specific resources the model uses. The WebAssembly heap is a linear block of memory the application is responsible for using correctly, and a native model session within it must be disposed deliberately; nothing automatically reclaims it or contains the damage if it is mishandled. So a category of bug that a managed runtime would have absorbed becomes the application's problem to prevent. This is a real and often-underestimated cost of client-side ML: you inherit responsibility for resource lifecycles that server frameworks handle for you, and getting it wrong produces exactly the kind of corruption-driven, hard-to-trace failure this case study traced. Knowing that the environment offers less protection is what motivates building the structural safeguards that put the protection back.
Silent failure is worse than a crash
A theme worth drawing out from this case is that a silent failure is, from a user-trust perspective, worse than an outright crash, even though a crash feels more severe. A crash is unambiguous: something broke, the user knows it, and a clear error tells them what and suggests what to do. A silent failure produces the wrong outcome with no signal — the tool appears to run but does not deliver, or quietly does nothing — and the user is left concluding that the tool simply does not work, without any specific fault to point at. That diffuse "this is unreliable" impression is more corrosive than a specific, explained error, because it generalizes to the whole product rather than to one identifiable problem.
This is why the reliability work that came out of this bug emphasized honest failure so heavily: a tool that fails clearly when it must fail keeps the user's trust in a way a tool that fails silently never can. The goal is not to never fail — that is impossible across the diversity of real hardware and inputs — but to fail loudly and specifically rather than quietly and ambiguously. The original bug was the worst case of silent failure: wrong behavior, no error, no obvious cause. Converting that into either correct behavior or a clear error was as much about preserving trust as about correctness, because the user's sense that a tool is dependable is built precisely on its never producing a confident-looking nothing.
Disposal calls are necessary but not sufficient
The first, local fix for the bug was to add the missing disposal call where the stale session was detected, and it is worth being clear about why that was correct but insufficient, because the distinction is the heart of the lesson. Adding the disposal made the specific code path correct: the session that was leaking got cleaned up, and that instance of the corruption stopped. But the pattern of manually managing a model session's lifecycle existed wherever the tool ran a model, so a disposal call in one wrapper did nothing for every other place the same pattern lived. The local fix removed one instance of a bug whose cause was an architecture that invited the bug in many places.
This is the difference between fixing an occurrence and fixing a class. A correct disposal call makes existing code right but leaves the system one forgotten disposal away from the same corruption reappearing somewhere else — it depends on every developer remembering to do the right thing in every place, forever. That dependence on perfect manual discipline is itself the deeper bug. Recognizing that the local fix, while correct, did not address the architecture that required the manual discipline is what motivated the structural change, because making the existing code correct is weaker than making the incorrect state impossible. The disposal call was necessary to stop the immediate bleeding and insufficient to prevent the next instance, which is exactly the realization that turns a patch into a rebuild.
The generalizable debugging lesson
Beyond the specific bug, this case carries a debugging lesson that generalizes well past background removal: when a failure is intermittent, state-dependent, and lacks a useful stack trace, stop chasing the symptom and start reasoning about what persists between operations. The instinct to debug at the point where the error surfaces is exactly wrong for this class of bug, because the surface point is the victim, not the culprit. The productive move is to ask what state survives across operations and could be corrupted or mismanaged, which redirects the investigation from the misleading symptom toward the actual cause.
The "works then degrades over a session" pattern is the key diagnostic signature, and learning to recognize it is worth more than the specific fix. It points away from logic errors in any single path and toward accumulating resource or memory problems, which narrows a vast search space dramatically. Any developer working with native resources, manual memory, or long-lived sessions in a constrained runtime will eventually meet a bug with this shape, and knowing in advance that the symptom location is a red herring and the cause is persistent state can save days of chasing the wrong thing. The transferable value of the case study is not the disposal call but the diagnostic instinct: degradation over time means look at what persists, not at where it broke.
The fix that made the bug class impossible
The durable fix was structural: stop sharing model sessions across jobs at all. Under per-job worker isolation, every job spawns a fresh Worker that is hard-terminated on completion or failure. There is no long-lived session to leave undisposed, because the worker that held the session is destroyed when the job ends. A stale session cannot corrupt the next job's memory because there is no shared memory between jobs — each runs in its own worker that is thrown away afterward. The entire category of "undisposed session corrupts a later operation" stops being possible, not because every disposal call is now correct, but because there is nothing persistent left to dispose incorrectly.
There is a clean lesson in the contrast between the two fixes. The local fix — add the disposal call — made the existing code correct. The structural fix — isolate every job in a disposable worker — made the incorrectness unrepresentable. Whenever you can choose the second kind of fix, it is worth the extra work, because it converts "we must remember to do this everywhere, forever" into "this can no longer go wrong." The bug also produced a standing operating principle: when a fix like this lands, apply it across every tool that shares the pattern rather than only the one that reported the failure, because a silent bug that surfaced in one place almost certainly lives in several. That all-tools discipline, and the rebuild that came with it, are covered in the companion retrospective.