What is an LLM, and how does it work?
People use LLMs every day without really knowing what one is. ChatGPT, Claude, Mistral, Gemini are different brands of more or less the same thing. The technology is in everyone’s hands, but the basic mental picture stays fuzzy.
I get the question often at Klai, usually wrapped in a worry: “Does our data go to America?” or “Are our documents being used to train the model?” The honest answers depend on understanding what an LLM actually is, so let me try to explain it in plain language.
I keep technical things simple. So we start with the basics.
An LLM is a file
LLM stands for large language model. The “large” is literal: tens to hundreds of gigabytes. The “language” means it works with text. The “model” is the part most people get wrong.
A model is a file. The same way a Word document is a file or a song is a file. Inside that file are billions of numbers, called weights. Those numbers are what the model picked up during training. Nothing else.
The model is not a thinking entity sitting in a data centre somewhere. It is not a service. It is not a website. It is a file. A file that, when you point the right software at it, can produce text.
That is the most important sentence in this post. Once you see a model as a file, every other question about it becomes easier to ask.
How does it actually answer a question?
You type a question. The software passes it to the model. The model does math on the numbers in the file. The output is one prediction: given everything that came before, what is the most likely next word?
Then the software adds that word to the conversation and asks the model again: what is the next word now? And again, and again, until the model produces a stop signal and the answer is complete.
That is the whole trick. An LLM is, at heart, a very sophisticated next-word predictor. It does this so well that the result looks like understanding. It is statistically convincing. It is not thinking.
This matters because it means an LLM has no memory between calls. Each question reaches the model as if it were the first one, because the file has no state between calls. The software around it can remember by feeding the previous messages back in each time, but the file itself does not change after every answer. It is the same file before and after.
Training and using are two different things
There are two phases in a model’s life, and people mix them up constantly.
Training happens once, before anyone uses the model. A company feeds it an enormous amount of text and lets it find patterns over weeks of very expensive compute. The result of all that work is the file. Training is where the privacy controversy usually lives: if the company trained on your data, your data shaped the file, and the file may produce things based on it later.
Inference is what happens every time you use the model. You give it a question, the file does math, an answer comes out. That is one call. Nothing is being added to the file. Your question does not change what the model will say to the next person, unless somebody specifically decides to feed your question into the next training run.
The privacy questions for training and for inference are completely different. Training is about whether your data influenced the model. Inference is about where your data travels at the moment you ask a question.
Where does the math actually happen?
There are exactly two places it can happen.
Someone else’s server. You send your question over the internet to a company that runs the file. Their copy of the file does the math and the answer comes back. ChatGPT, Claude and the Mistral API all work this way. Your question leaves your network. What happens to it after that depends entirely on the provider’s terms.
Your own server. You download the file. You run it on hardware you control. The model does the math without ever talking to the company that made it. Nothing leaves your network.
Both are real options, but only for open source models. Mistral, Meta’s Llama and OpenAI’s open-weight releases can be downloaded from a platform called Hugging Face and run anywhere you have a GPU. Closed models like the consumer version of ChatGPT cannot. There is no file you can download. The only way to use them is to send your data to the provider.
The questions to ask any AI vendor
Once you see the model as a file, the questions you should ask any AI vendor become very specific.
- Whose file is it? Open source, or closed?
- Where does the file live? On the vendor’s servers, on your servers, or somewhere in between?
- Whose machine does the math when you ask a question?
- What happens to your question after it gets there?
- Will your data influence the next version of the file?
The vague version of these questions is “is your AI safe and compliant.” The vague version cannot be answered honestly. The specific version can.
Where Klai sits today
I will end with what this means for us, since that is where this conversation usually starts.
Today, when you ask Klai a question, two things happen. We search your knowledge base on our own GPU in Europe using an embedding model we host ourselves. Then we send the question and the relevant context chunks to Mistral’s hosted API, which runs the language model on EU infrastructure and sends the answer back to us.
So part of the loop is inside our walls, and part of it is a call to Mistral.
We chose Mistral because they are European, because their commercial API terms commit them to not training on customer queries, and because they sit outside the reach of the US Cloud Act. ChatGPT and Claude do not.
We are working toward the next step: an architecture that runs any open-source model on our own GPU, isolated per customer. Mistral today, Llama tomorrow, whatever publishes next year. The file lives on hardware we control, in the Netherlands, the same way we already host the embedding model. You pick which one fits your use case.
When that lands, the loop closes inside our walls. Same conversation, no external API call.
The longer-term ambition goes further. We want everything Klai runs on to live in the Netherlands. Not just inference, not just embeddings, but every component. We are based in Groningen. We think a Dutch knowledge company should be able to choose Dutch infrastructure top to bottom, and we are building toward that on purpose.
We are not there yet. We are building toward it, and we will be honest about where we are along the way.
If you want to walk through any of this on your own data, get in touch. No pitch, just a walkthrough.