Field guideStack & engineering

2026 · Stack & engineeringAbout 12 min readNovus Stream Solutions

Shipping a model registry you can trust: auditing against Hugging Face

It is easy to list an impressive model ID in your code and hope it works. The NSS Background Remover audited its entire model registry against the Hugging Face API and Transformers.js, replacing unverified IDs with verified models or honest classical algorithms. Here is why that audit matters more than any feature.

A model registry being checked against the Hugging Face API, with unverified entries flagged and replaced

Overview

There is a particular kind of dishonesty that is easy to ship and hard to detect: list a famous-sounding model ID in your code, wire up something that mostly works, and let users assume the impressive model is doing the work. Nobody checks. The output looks plausible. The marketing writes itself. The NSS Background Remover's v1.1.0 release was, at its core, a refusal to do this — an audit of the entire model registry against the Hugging Face API and Transformers.js v3 to make sure every claimed model was real, available, and actually being used. This post is about why that unglamorous audit matters more than most features.

The premise is simple: if you are going to tell users a tool uses AI, the AI should be real, and you should know its real name, size, and license. Anything less is asking for trust you have not earned.

What the audit found and fixed

The audit checked the registry against the live Hugging Face API and the Transformers.js runtime, and it found gaps — five unverified model IDs that could not be confirmed to exist and run as claimed. Rather than leave them in and hope, the release replaced each one with either a verified model or an honest classical algorithm. The verified models came with their real specifications attached: vit-gpt2 image captioning at around 120 MB, TrOCR for printed text at around 500 MB, a Donut document model at around 280 MB, Whisper-base at around 140 MB, depth-anything-small at around 50 MB. Real IDs, real sizes, real licenses.

The point of replacing rather than removing is that the capability stayed available to users; only the dishonesty was removed. A feature backed by a model that might not load was swapped for one backed by a model that does, so the user experience improved at the same time the claims became true.

Classical baselines, labelled as classical

The most honest move in the release was admitting where there was no neural model and saying so. For a range of capabilities — denoise, deblur, colorize, face restore, certain generation and editing tasks — the tool uses classical, algorithmic baselines rather than heavy AI models, and it labels them that way instead of dressing them up as deep learning. A good classical algorithm is a perfectly legitimate way to solve a problem; pretending it is a neural network is not.

This matters because it sets accurate expectations. A user who knows a feature is a classical baseline understands what it will and will not do, and is not disappointed that it is not magic. The alternative — calling everything "AI" regardless of what is under the hood — erodes trust the first time a user notices the gap between the claim and the result. Honest labels are cheaper than recovered trust.

Registry entries split into verified neural models with real sizes and honestly-labelled classical baselines
Verified neural models carry real IDs, sizes, and licenses; classical baselines are labelled as such, not hyped.

Why honesty is an engineering property, not a marketing one

It would be tempting to file this under marketing or ethics, but it is fundamentally an engineering decision with engineering payoffs. A registry of verified, available models is a registry that does not surprise you in production with a model ID that 404s or a runtime that cannot load what the code claims. The audit hardened the system against a whole class of "it works on my machine, why is it failing for users" bugs by ensuring the declared models and the loadable reality matched.

It also made the downstream honesty possible. You cannot ship an honest tier system — Lite, Standard, Pro with real sizes — on top of a registry full of unverified IDs, because the sizes would be fiction. The audit is the foundation the tier post stands on. Honesty compounds: verify the models, then you can size them truthfully, then you can recommend tiers a device can run, then you can claim the tool is enterprise-grade without lying.

How an unverified model ID hides in plain sight

To appreciate why the audit mattered, it helps to see how easy the failure it caught is to commit. In a codebase that wires up AI features, a model is often just a string — an identifier pointing at a hosted model — and nothing about writing that string forces it to be correct. You can type a plausible-looking identifier, wire up surrounding code that mostly works, and ship, and in development the gap may never surface because the path that actually loads the model is not exercised on every run. The dishonesty is not malicious; it is the natural result of a model reference being a string that no compiler or test necessarily checks against reality.

That is precisely what makes it dangerous: it is invisible until the moment it is not. A model ID that does not resolve, or resolves to something other than what the feature claims, will sit quietly in the code looking exactly like a correct one, and the first sign of trouble may be a user hitting a failed load in production. The audit existed to surface that hidden class of problem deliberately rather than waiting for users to find it — to check, systematically, that every model the code claimed to use was a model that genuinely existed and loaded, before anyone relied on it.

Checking against the source of truth

The audit was not a matter of opinion or spot-checking; it was a systematic check against authoritative sources. Every model in the registry was verified against the Hugging Face API — the canonical record of what models exist and what they are — and against Transformers.js, the runtime that actually has to load and run them in the browser. Those two together answer the only questions that matter: does this model exist as claimed, and can the runtime we ship actually load and execute it? A model that passes both is real and usable; one that fails either is a liability dressed as a feature.

Checking against the runtime, not just the registry, is the part that makes the audit thorough rather than cosmetic. A model could in principle exist on a hub but not load cleanly in the specific browser runtime the tool uses, and verifying only existence would miss that. By confirming each model against both the catalog and the execution path, the audit established that the declared models and the loadable reality matched in the environment that counts — the user's browser — which is a stronger guarantee than a paper check of identifiers against a list. The standard was not "this name looks right" but "this model loads and runs where we ship it."

Replace, do not remove: keeping the capability

When the audit found the five unverified identifiers, the response was deliberate: replace each with either a verified model or an honest classical algorithm, rather than simply deleting the feature. That choice reflects a respect for the user, who wanted the capability and should not lose it just because the original implementation's honesty did not hold up. The feature stayed available; only the unverified claim behind it was swapped for something real. In several cases the user experience actually improved at the same moment the honesty did, because a feature backed by a model that reliably loads is better than one backed by an identifier that might not.

This replace-not-remove discipline is what keeps an honesty audit from becoming a feature graveyard. It would have been easier to strip out anything that did not verify and ship a smaller tool, but that punishes users for an internal problem. Finding a verified model that does the job, or an honest classical method that delivers the capability without pretending to be a neural network, preserves the value while fixing the truth. The result is a tool that both does more and claims more accurately, which is the rare combination an audit done this way produces — integrity gained without capability lost.

Real sizes and licenses, attached to real models

A verified model comes with verified facts, and the audit attached those facts rather than leaving them vague. The replacements carried their genuine specifications — identifiers like a vit-gpt2 captioning model around 120 MB, a printed-text OCR model around 500 MB, a document model around 280 MB, a base speech model around 140 MB, a small depth model around 50 MB — each with its real size and license. Those numbers are not decoration; they are what make the downstream honesty possible, because you cannot present an honest tier system or an accurate download expectation on top of models whose sizes are guesses.

The licenses matter as much as the sizes, especially for anyone using the tools commercially. Knowing a model's license is what lets a professional decide whether they can use its output in their work, and attaching the real license to the real model treats the user as someone making an informed decision rather than trusting an opaque box. This is the same respect-for-the-user thread that runs through the whole product — honest sizes so downloads hold no surprises, honest licenses so commercial use is on solid ground — and it is only possible because the audit established which models were genuinely in use in the first place.

How the audit underwrites the tier system

The model-registry audit and the honest tier system are not separate initiatives; the first is the foundation of the second. A tier that advertises a certain amount of verified models, at stated sizes, can only be honest if those models genuinely exist and load at those sizes — and that is exactly what the audit established. Without it, the tiers would be built on sand, naming sizes for models that might not resolve and presenting a precision that was really fiction. The audit is what lets the tiers make a truthful promise about what a user is downloading and what their device will run.

This dependency runs one direction and is worth making explicit: you verify the models, which lets you state their real sizes, which lets you group them into tiers a user can reason about, which lets a device probe recommend a tier that will actually run. Pull the verification out from under that chain and every link above it becomes marketing. The audit is therefore not a one-time cleanup but the load-bearing base of the suite's honesty about its AI — the reason the tier labels, the download progress, and the device recommendations can all be trusted is that the models underneath them were checked against reality first.

Why this is reliability work, not virtue

It is tempting to file an honesty audit under ethics, but its most concrete payoff is reliability, and that is the framing that should sell it to any engineer skeptical of "honesty" as a goal. A registry of verified, loadable models is a registry that does not surprise you in production with an identifier that fails to resolve or a runtime that cannot load what the code declared. The audit hardened the system against an entire class of "works in development, fails for users" bugs by ensuring the declared models and the loadable reality agreed in the environment that ships. The honesty and the reliability are the same property viewed from two angles.

This reframing matters because it makes the audit obviously worth doing rather than a nice-to-have. Even setting aside the trust argument entirely, verifying your model registry prevents real production failures, and preventing production failures is uncontroversially valuable. The fact that it also makes the tool's claims truthful is a bonus that happens to fall out of doing the reliability work correctly. Honest and dependable turn out to be the same engineering act here — check that what you claim to use is what actually loads — which is why the audit belongs in the same family as the device-lifecycle detection and integrity checks that the reliability-hardening work added.

The cost of getting this wrong in production

It is worth being concrete about what the audit prevents, because the failure it guards against is genuinely bad when it reaches users. An unverified model identifier that does not resolve produces a failed load — a feature that simply does not work, often with no clear explanation, at the moment a user tries to use it. An identifier that resolves to something other than what the feature claims is worse, because it produces output that does not match the promise without obviously failing, which erodes trust quietly. Both are the kind of problem that surfaces only in production, on users' machines, after the code looked fine in development.

The economics of catching this early are lopsided in the audit's favor. A systematic verification pass is a bounded, one-time cost paid before shipping; a model-loading failure discovered by users is an unbounded cost paid in support, lost trust, and the slow erosion of a tool's reputation for working. For a small operation especially, where reputation is fragile and every user's first impression matters, the asymmetry is stark — a few hours of verification against the months of doubt a single visible failure can seed. The audit is cheap insurance against an expensive, reputation-damaging class of bug, which is why it was worth doing as a dedicated release rather than left to chance.

Why the registry needs ongoing attention

An audit is a snapshot, and a model registry is a living thing, so the honest framing is that verification is a discipline to maintain rather than a box ticked once. Models on a hub can move, be deprecated, or change; runtimes evolve; new capabilities get added with new model references that themselves need checking. The value of the audit is not only the five identifiers it fixed but the standard it established — that a model in the registry is one that has been verified to exist and load — and keeping that standard true requires re-checking as the registry grows and the ecosystem around it shifts.

This is why the audit is best understood as installing a practice, not performing a cleanup. The practice is simple to state: no model reference ships without verification against the catalog and the runtime, and the registry is re-audited as it changes. Treating it as ongoing is what prevents the slow re-accumulation of unverified references that would otherwise creep back in over time, the same way the all-tools discipline prevents a fixed bug from reappearing in an untouched corner. The first audit proved the registry could be made honest; keeping it honest is the standing commitment that audit implies.

The transferable lesson

For anyone building AI features, the lesson generalizes past this one product. The gap between "we list an impressive model" and "we use a verified model and know its real properties" is invisible to users right up until it is catastrophic — a failed load, an output that does not match the claim, a license problem discovered too late. Auditing the registry closes that gap before it bites, and the cost is one focused release rather than a slow accumulation of trust debt.

The broader pattern across this product is that honesty is treated as a feature you build deliberately, not a posture you adopt. The reliability-hardening post shows the same instinct applied to runtime failures, and the tier post shows what an honest registry makes possible for users. The docs list the current verified model lineup if you want the reference detail.