What happens between you hitting send and the AI responding
Published: March 27, 2026 · 5 min read
AI responses feel instant until they don't. Once you start working with these tools regularly (long threads, big documents, complex prompts), the cracks start showing in ways that aren't obvious unless you know what's going on under the hood.
Step 1: Tokenization
Your message never reaches the model as raw text. Before anything else, it gets split into tokens using a tokenizer, a separate component with its own vocabulary, trained independently before the model ever sees it. The vocabulary is built using an algorithm like byte-pair encoding, which learns to merge common character sequences into single tokens based on frequency in a large corpus. Common sequences get their own entry. Rarer ones get broken into pieces.
"France" is one token in most tokenizers. A made-up word like "florbinate" would get broken into two or three pieces, because the tokenizer has no entry for it and falls back to splitting it into subword chunks it does recognize. This is worth knowing because tokens are what you're billed on when using the API, and they're what counts against your context window, not characters or words.
Interactive
Type something and watch it become tokens
Spaces shown as · — they're usually attached to the next token, not treated separately. This is an approximation; real tokenizers vary by model.
Step 2: Context assembly
Every request sends everything: the system prompt, your entire conversation history, and your new message, all packed into one block of tokens. There's no persistent session on the model's side. Each request is stateless.
When AI apps offer a "memory" feature, that's an engineering layer someone built on top. They're storing your past conversations and injecting the relevant bits back into the context window before each call. The model itself retains nothing between requests.
Interactive
What the model receives — every single request
One nuance worth knowing: modern inference systems use KV caching, where the computed state of previous tokens gets cached and reused rather than recomputed from scratch. This is why providers like Anthropic offer prompt caching as a feature. If your system prompt is long and consistent across requests, you can avoid paying to reprocess it every time. Even so, the fundamental model is stateless. Caching is an optimization on top of that, not a change to how the architecture works.
Step 3: The forward pass
With everything assembled, the model runs a forward pass. The full token sequence goes through layers of attention and feed-forward networks, and at the end you get a probability distribution over the vocabulary, the model's best guess at what token should come next, given everything it just processed. One token gets sampled from that distribution, and that's the output of a single forward pass.
Step 4: Autoregressive generation
That sampled token gets appended to the sequence, and the whole thing runs again to predict the next one. This repeats until the model hits a stop condition. Each prediction is conditioned on everything before it, including what it just generated, and there's no lookahead or planning. It has no idea how a sentence ends when it starts writing it.
This is also why "think step by step" works as a prompt technique. It forces the model to generate reasoning tokens before the final answer, so by the time it's producing the answer it has more useful context to condition on. No backtracking, no revision. The model commits to each token before moving to the next, so front-loading reasoning genuinely helps.
Interactive
Watch the response generate one token at a time
You
wait so you literally can't remember my coffee shop order?
Press Send to see what happens
This is called autoregressive generation. It's why long outputs can drift. The model is making local decisions at each step, not executing a plan. Getting a coherent structured document out of this process is genuinely harder than it looks from the outside.
Step 5: Streaming
Each token gets sent back as it's generated, which is why responses appear word by word rather than all at once. It's also why the wait before the first word appears (the time to first token) is mostly determined by input size, not output length. Processing a long prompt takes longer than processing a short one, regardless of how long the response ends up being.
What changes when you know this
When you send a follow-up asking for a revision, the model generates a completely new response with your instruction added to the context. It's not editing. The previous output only factors in because it's sitting in the conversation history.
Context window limits aren't arbitrary caps, they're a property of the architecture. Long conversations have to go somewhere. Either you hit the limit, or the app is quietly summarizing and trimming your history before each request to stay under it.
There's more underneath all of this, including how attention mechanisms work, what sampling temperature actually controls, and what makes inference different from training, but this is the core loop. Everything else builds on top of it.