Not every document is the same

Part of: Modelling knowledge

You upload a PDF, a meeting transcript, and a knowledge base article to the same system. None of them fit into a search engine as-is, so the system breaks each one into smaller passages, short enough for a search algorithm to compare against a user’s question. From that point on, all three are treated identically: just text waiting to be found.

This is how most knowledge base systems work. It is also why searching works well for some content and poorly for others.

Why does the same system work well for articles but badly for transcripts?

A knowledge base article has structure. It opens with a summary or introduction, uses headings to signal topic changes, and presents information in a logical sequence. If you split it into chunks of roughly equal size, each chunk tends to be about one thing.

A meeting transcript has none of that. It jumps between topics. People interrupt each other. Important decisions show up in a single sentence between two tangents. If you split a transcript into chunks of the same size you would use for an article, you get fragments that start mid-sentence and end mid-thought. The vocabulary gap between how users search and how content is stored hits harder when the content was never structured to begin with.

The problem is not the chunking algorithm. The problem is that the same chunking parameters do not fit both documents.

What would you actually change per document type?

Three things, at minimum.

How much surrounding context the system sees. When the system enriches a chunk (adding a context prefix, generating questions), it needs to understand where that chunk sits in the document. But “where it sits” means different things for different content:

  • An article has its context up front. The title and introduction tell you what the whole document is about. The first few hundred tokens are enough.
  • A transcript has temporal context. What matters is what was said right before and right after the current segment. A rolling window around the chunk position is the right approach.
  • An email thread has its context at the end. The most recent reply is the most important one. Reading from the bottom up gives the system more useful context than reading from the top.
  • A PDF has explicit front matter. The title page, table of contents, and section headers tell you what the document covers. Extracting those first gives the system a map of the whole document before it looks at any single chunk.

How large the chunks should be. Dense, structured documents like PDFs tolerate larger chunks (400-800 tokens) because the information is packed tightly. Conversational content like one-on-one meetings needs much smaller chunks (100-300 tokens) because each exchange is brief and self-contained. A single chunk size for everything means either the articles are over-fragmented or the transcripts are under-fragmented.

What kind of questions to generate. HyPE generates hypothetical questions at storage time so that user queries can match question-to-question. But a question that makes sense for an article (“What is the return policy for international orders?”) looks nothing like a question that makes sense for a meeting transcript (“Who is responsible for the Q3 deadline?”). The system needs different instructions per content type.

What does a content profile actually look like?

A content profile is a set of rules that tells the ingestion pipeline (the sequence of steps a document passes through from upload to searchable) how to process a specific type of document. Each profile defines:

  • Where to look for surrounding context: the start of the document (articles), a sliding window around the current passage (transcripts), the most recent message (emails), or the title page and table of contents (PDFs)
  • How large each passage should be and how much surrounding text the system reads alongside it
  • Whether to generate hypothetical questions for this content type
  • What those questions should focus on (decisions and deadlines for meetings, how-to questions for PDFs, search reformulations for articles)

For example: a meeting transcript profile uses a rolling window for context, generates questions focused on decisions, action items, and deadlines, and uses small chunks. A PDF profile extracts front matter first, generates how-to and definition questions, and uses larger chunks.

The profile is selected at the point of ingestion. A web crawler detecting a PDF sets it to one profile. A meeting transcription service sets it to another. A knowledge base article uploaded through the portal gets a third. The rest of the pipeline does not need to know which type it is processing. It just follows the profile.

What happens when the system does not know what type it is?

It falls back to a conservative default. No question generation. A generic context strategy. Medium chunk sizes. The document gets stored and becomes searchable, but without the enrichment that a known content type would receive.

This is a deliberate choice. Generating the wrong questions is worse than generating no questions. A PDF enriched with meeting-style questions (“Who made this decision?”) would pollute search results with irrelevant matches. The safe default is to do less, not more.

The trade-off

Content profiles add complexity. Instead of one ingestion path, you maintain several. Each new document type means a new profile with its own parameters to tune. When a profile is wrong, the errors are subtle: slightly worse retrieval for that content type, slightly less relevant questions, slightly larger or smaller chunks than ideal.

The alternative is worse. A single set of parameters that works acceptably for articles and poorly for everything else. The inconsistency shows up as retrieval quality that varies by content type, with no obvious explanation for why some queries work and others do not.

Treating different documents differently is the cost of retrieval that works consistently across all of them.

Where Klai is

Klai currently runs six content profiles: knowledge base article, meeting transcript, one-on-one transcript, email thread, PDF document, and a conservative fallback for unrecognized content. More will follow as new source types are added. Each profile controls the context strategy, chunk size range, question focus, and whether HyPE is enabled.

The content type is set at the source. The web crawler detects PDFs by content-type header. The transcription service distinguishes meetings from one-on-ones. The portal tags articles at upload. Everything downstream follows the profile without branching logic.

All of this enrichment takes time, so documents are made searchable immediately with basic vectors and enriched in the background. A document saved right now is findable within a second; the full enrichment completes silently over the next minute.

If the pre-retrieval gate lets a query through and the retrieval pipeline runs, the quality of results depends directly on how well the content was processed at storage time. Content profiles are where that quality starts.


This is the final post in the “Retrieval that works” series. The earlier posts cover how to find knowledge base gaps, how to verify fixes, when to skip retrieval entirely, and how multiple search signals combine into one answer.