Not every question needs the knowledge base
The instinct is that more retrieval is better. If a user types something, search the knowledge base. If they type something else, search it again. Every query is an opportunity to surface relevant content.
This instinct is wrong, or at least incomplete. Some queries are not questions. Some queries have no meaningful answer in the knowledge base. Retrieving content for them does not help the user. It adds latency, adds cost, and in the worst case, surfaces content that looks relevant but is not, causing the AI to weave unrelated knowledge into its response.
The better instinct is: retrieve only when retrieval is likely to help.
What happens when you retrieve for everything
A user types “thanks” into a chat interface backed by a knowledge base. The retrieval system embeds the query, searches the vector store, finds the five closest chunks. Those chunks are about something, because the vector store always returns results. There is no “nothing found” in a similarity search, only “the best match I have, however poor.”
Those chunks get injected into the AI’s context. The AI now has to decide what to do with them. If it is well-instructed, it ignores them. If it is not, it tries to incorporate them, producing a response that references some knowledge base article about a thank-you email template, or a customer service protocol.
The retrieval was not just unnecessary. It was actively harmful, for three separate reasons.
It produces worse answers. The AI now has chunks in its context that have nothing to do with the user’s intent. A well-instructed model ignores them. A less well-instructed one tries to work them in, producing a response that references a knowledge base article about thank-you email templates when the user just wanted to say thanks. The more irrelevant context you inject, the higher the chance the AI says something strange. This is the flip side of quality at storage time: even well-stored content harms the response when it is retrieved for the wrong query.
It is slower. Embedding the query, searching the vector store, fusing results from multiple indexes, running a reranker. That is the full retrieval pipeline. For a query that was never a question, all of that latency sits between the user pressing Enter and seeing a response. Skipping retrieval for a greeting means the AI can respond immediately.
It wastes energy. Every retrieval call runs an embedding model, hits a vector database, and in most systems triggers a reranker. These are GPU-backed operations. Running them for “ok” or “bedankt” burns compute for zero value. At scale, the fraction of queries that are conversational rather than informational is significant. Not retrieving for those queries is not just a performance optimisation. It is the responsible thing to do with your infrastructure.
The same pattern applies to greetings, confirmations (“got it”, “begrepen”), short follow-ups that only make sense in the context of the previous message, and any input that is clearly conversational rather than informational.
Two layers of filtering
Filtering trivial queries does not require a single mechanism. Two layers, each catching different things, work better than one that tries to catch everything.
The first layer is simple pattern matching. Messages shorter than a few words, known greetings and confirmations in every language the system supports, and one-word replies can be matched with a regular expression. This is fast, costs nothing, and catches the obvious cases. “Hi”, “ok”, “bedankt”, “yes” never need to reach the knowledge base.
The second layer is semantic. Some queries look like real questions but are still not knowledge-base questions. “How are you?” is four words and passes the length filter, but it has no answer in any knowledge base. A semantic gate handles this by comparing the query’s meaning against a reference set of non-retrieval queries. If the query is semantically close to the reference set, retrieval is skipped.
The distinction matters because the two layers fail differently. The pattern matcher is rigid: it catches exactly what it is told to catch, nothing more. The semantic gate is flexible: it generalises from examples, but it can make mistakes. Running both means you get the certainty of exact matching where possible, and the flexibility of semantic matching where needed.
How a semantic gate works
The mechanism is straightforward. At startup, the system embeds a set of reference queries: greetings, chitchat, conversational fragments that should never trigger retrieval. These embeddings are cached.
When a user query arrives and passes the pattern filter, the system computes cosine similarity between the query’s embedding and every reference embedding. It then looks at the margin between the highest similarity score and the second-highest.
A high margin means the query is very close to one specific reference category and far from others. That is a confident signal: this query looks like chitchat, not like a knowledge question. Retrieval is skipped.
A low margin means the query sits in an ambiguous space. It might be conversational, but it might also be a real question that happens to resemble one. The safe default is to retrieve. False negatives (retrieving when you did not need to) cost latency. False positives (skipping retrieval for a real question) cost a wrong answer.
The trade-off
The gate is a bet. Every query it bypasses is a query that gets faster responses and avoids irrelevant context injection. Every query it bypasses incorrectly is a query where the user gets a response without the knowledge they needed.
The threshold controls where that bet falls. A high threshold means the gate only fires when it is very confident the query is not a knowledge question. Fewer false positives, but more unnecessary retrievals. A low threshold means the gate fires more aggressively: fewer unnecessary retrievals, but a real risk of skipping when it should not.
There is no setting that eliminates both failure modes. You choose which failure you tolerate more. For most knowledge-base use cases, a missed retrieval is worse than an unnecessary one. The threshold should lean conservative.
Where Klai is
Klai runs both layers. The first is a regex pattern matcher in the LiteLLM hook that catches known greetings and confirmations in Dutch and English, plus any message under eight characters. These are dropped before retrieval is even considered.
The second is a cosine-margin gate in the retrieval pipeline. Reference queries are embedded at startup and cached. For each incoming query, the gate computes the margin between the top similarity scores and compares it against a configurable threshold. When the margin is large enough, the full retrieval pipeline (vector search, fusion, reranking) is skipped entirely.
The gate is conservative by default: it only bypasses when the signal is strong. The retrieval_bypassed flag is logged on every request, so the ratio of bypassed to total queries is always visible.
Next up in this series: what happens when retrieval does run. How three different search signals are combined into one ranked result, and why a single similarity score is not enough. Read three signals, one answer.