Three signals, one answer

Part of: Modelling knowledge

A colleague asks you a question at work. Not a complicated one: “Can we do that for part-time employees too?”

You know what “that” refers to because you were just discussing the new training budget policy. You know “part-time employees” narrows it down to a specific category. You do not need to think about any of this. The context is right there.

Now imagine a search engine receiving the same question. It has no idea what “that” means. It does not know what you were discussing ten seconds ago. It takes the words at face value and does its best, which in this case means searching for a vague pronoun and hoping something relevant comes back.

This is how most knowledge base retrieval works. The user asks a follow-up question, and the system treats it as if it arrived in a vacuum. What follows is what happens inside a retrieval pipeline when you do not accept that limitation.

Step 1: Figure out what the user actually means

The first thing that happens is not a search. It is a rewrite.

The system takes “Can we do that for part-time employees too?” and looks at the last few messages in the conversation. It sees the training budget discussion. It rewrites the query into something standalone: “Does the training budget policy apply to part-time employees?”

This is called coreference resolution. A small, fast language model reads the conversation history and replaces pronouns and references with their actual meaning. The rewritten query is what enters the search pipeline, not the raw input.

Without this step, the entire pipeline that follows would be searching for “that.” With it, the pipeline searches for what the user actually wants to know.

Step 2: Turn the question into numbers

Search engines do not compare words. They compare vectors: lists of numbers that represent meaning in a mathematical space. Two sentences that mean similar things produce vectors that are close together, even if they use completely different words.

The rewritten query gets embedded twice, in parallel. One embedding captures the meaning as a dense vector (a long list of numbers where every position has a value). The other captures term importance as a sparse vector (a long list where most positions are zero, and only the important terms have values).

Dense vectors are good at meaning: “training budget” and “learning and development allocation” will be close together. Sparse vectors are good at precision: if the user specifically said “part-time,” that term gets a high weight.

Both representations of the same question are now ready to search.

Step 3: Search three indexes at once

Here is where it gets interesting. The vector database does not store content in one way. It stores three representations of every chunk of knowledge:

The original text, embedded. This is the chunk as it was written, converted to a dense vector. If the user’s vocabulary matches the content, this search finds it.

Hypothetical questions, embedded. At storage time, the system generated questions each chunk would answer. When a user asks something similar, it matches question-to-question instead of question-to-answer, bridging the vocabulary gap before search even happens. (This is HyPE, one of the enrichment steps that happen at storage time.)

Learned sparse vectors. A neural model assigned term weights to each chunk, producing a multilingual, context-aware sparse representation. Think keyword search, but one that understands which words actually matter.

Three searches run in parallel. Three ranked lists come back. Each list has a different opinion about what the best answer is.

Step 4: Merge the opinions

Three ranked lists, three different scoring systems. A dense similarity score of 0.82 and a sparse score of 0.65 do not mean the same thing. You cannot just add them up.

Reciprocal Rank Fusion (RRF) solves this by ignoring scores entirely and looking only at rank positions. A chunk that appears near the top in all three lists is probably relevant. A chunk that tops one list but is absent from the others might be a false positive. Agreement across different representations is a stronger signal than a high score from one.

Step 5: A second opinion on the finalists

The fused list is good, but it is still based on vector comparisons: the query and each chunk were processed separately and compared by distance. A chunk might contain the right keywords but answer a slightly different question.

A cross-encoder reranker catches these. Instead of comparing vectors, it reads the query and each candidate together. “Does the training budget apply to part-time employees?” next to a chunk about “full-time training benefits” scores lower than it would on vector similarity alone, because the reranker understands the distinction.

This runs only on the top results from fusion. It is expensive per comparison, but it only needs to look at a short list.

What each step compensates for

This is the point of the pipeline. No single step is sufficient.

Coreference resolution fixes the input. Without it, the system would search for a pronoun. Dense search finds content that means the same thing, but misses vocabulary mismatches. HyPE catches those mismatches by matching questions to questions. Sparse search adds term-level precision that dense search lacks. Fusion combines the three opinions without needing to calibrate their scores. The reranker catches the nuances that vector similarity misses.

Remove any one step and a category of queries starts failing. The pipeline works because each component covers the blind spots of the others.

Where Klai is

Klai runs all five steps on every retrieval query. Coreference resolution first (a fast LLM call using the last three conversation turns), then dense and sparse embedding in parallel, then three-leg RRF fusion in Qdrant, then a cross-attention reranker (bge-reranker-v2-m3) as a final precision pass.

If the pre-retrieval gate determines the query is not a knowledge question, the entire pipeline is skipped. When it does run, the full sequence takes under a second.


Next up in this series: why not every document should be processed the same way. How a PDF, a meeting transcript, and a support article each need different chunking, different enrichment, and different questions generated at storage time. Read not every document is the same.