The self-managing taxonomy is a myth

3 March 2026 Mark

The pitch sounds right: your knowledge base learns its own structure. Documents come in, the AI discovers the categories, the taxonomy (the category structure that organises your knowledge base) manages itself. No manual maintenance. Just knowledge, sorting itself.

None of the production knowledge management systems we have studied actually work that way.

Two things that look the same and are not

There are two distinct jobs that both get called “taxonomy management,” and the confusion between them is where the self-managing claim falls apart.

The first job is document classification: when a new document arrives, decide which existing category it belongs to. This is largely automatable. Given a stable, approved set of categories, a trained model can assign incoming documents accurately and at scale. This part of the promise is real.

The second job is taxonomy curation: deciding whether a new category should exist, whether two existing categories should be merged, whether a name still fits, whether a split is needed. This is not automatable at acceptable accuracy. Enterprise Knowledge’s March 2025 industry analysis is direct: “One of the most desired features clients ask for is the use of Agentic AI to handle long-term updates to and expansion of a taxonomy. Yet decision making for taxonomy management still requires human judgment to determine whether management decisions are appropriate, aligned to organizational understanding and business objectives, and support taxonomy scaling.” Their conclusion: “to date there has been no ML or AI application or framework that can replace human decision making in this sphere.”

The self-managing claim conflates these two jobs. Once you separate them, the right architecture becomes obvious.

What topic modelling actually delivers

Topic modelling tools like BERTopic (software that analyses a collection of documents and groups them by semantic similarity to propose categories) are genuinely useful for one thing: generating an initial taxonomy proposal from unlabelled content. You have five hundred support tickets and no category structure. BERTopic reads them all, finds the clusters, and suggests names. That starting point would have taken a human analyst days. The tool does it in minutes.

The failure modes matter, though.

Between 20 and 40 percent of documents in a real-world collection end up unclassified by default. Not because they are bad documents: they sit between clusters, or cover topics the tool did not detect as significant. These are not noise. They are legitimate knowledge the model failed to place. Someone still has to handle them.

The category structure shifts over time. BERTopic uses a decay mechanism when learning incrementally: newer documents carry more weight, older ones less. The taxonomy you activate today is not the same taxonomy at month six, even if your actual subject matter has not changed. Category IDs drift. Downstream tagging breaks. For a system where stable category IDs matter (and they do, for retrieval and gap detection), incremental automatic updates without a review step are a production problem.

Running the same tool twice on the same documents can produce a different category structure. The process involves randomness by default: two runs on identical content are not guaranteed to produce identical output. Two people setting up the taxonomy independently get different results. This is fixable, but it requires knowing it is a problem.

None of this makes the tooling useless. It makes it a proposal generator, not a decision maker.

The right boundary

Automate what can be automated:

Tagging incoming documents to the existing approved taxonomy
Detecting when too many documents are going unclassified: a signal that the taxonomy needs attention
Generating candidate names and descriptions for newly detected clusters
Flagging when two categories look semantically similar enough to consider merging

Require a human decision for:

Activating a new category
Merging two existing categories
Renaming a category
Splitting one category into two

The reason the second list needs a human is not that the AI cannot generate a suggestion. It is that these decisions carry organisational context the system does not have. Whether two categories should merge depends on whether the distinction matters to the people who use the knowledge base. Whether a name fits depends on whether it matches how your team actually thinks. A false positive on a new category (activating a cluster that looked meaningful and turned out not to be) creates noise that persists until someone notices and cleans it up.

The lightweight process

Keeping a human in the loop does not mean lengthy review sessions. The minimum viable version is a review queue: the system generates a suggestion, one person approves or rejects it, the approved change is applied. Research on active learning annotation practice puts the time cost at two to four hours per quarter for a single reviewer managing a collection of several thousand documents, if the queue surfaces only the uncertain cases rather than everything.

That is the overhead of running the process correctly. It is not large. The alternative is a taxonomy that drifts, produces unreliable category assignments, and eventually requires a larger cleanup than the incremental reviews would ever have cost.

Where Klai is

Klai does not have this solved yet. What exists today is document classification against an existing taxonomy: new content gets tagged to the categories you have approved. The part we are building is the review layer: the interface where taxonomy proposals surface, a reviewer approves or adjusts them, and the change propagates correctly without breaking downstream tagging. The same editorial pattern as the self-improving loop: the system proposes, a human decides, the change propagates.

Next up in this series: why quality in a knowledge base is determined at the moment content is stored, not at the moment it is retrieved, and why fixing retrieval after the fact is always the more expensive path. Read quality happens at storage, not at search.