How LLMs Choose the Next Token: A Practical Guide to Decoding and Sampling

Large Language Models (LLMs) do not “think” in sentences. They generate text one token at a time, repeatedly answering a single question:

“Given everything so far, what should come next?”

guide_llm_decoding_params

The infographic above visualizes this process.

This article explains it without math, using a restaurant kitchen analogy — exactly the way you learned top-k and top-p earlier.

Large Language Models (LLMs) do not generate whole sentences in a single leap; they construct text one token at a time. At each step, the model receives all previous tokens and implicitly answers the question, “Given everything so far, what should come next?” The attached infographic visualizes this process by showing how the model produces a distribution over possible next tokens and how decoding parameters reshape that distribution before a single token is sampled and appended. The goal of this guide is to provide an intuition‑first explanation of decoding and sampling, using everyday analogies rather than formal mathematics, so that the behavior of temperature, top‑k, and top‑p feels concrete and controllable.

A useful way to think about decoding is through the metaphor of a restaurant kitchen. Imagine a head chef deciding which dish to serve next. The kitchen has many possible dishes, each with a certain likelihood of being ordered by customers, and the chef must choose one dish at a time. In the same way, an LLM’s decoder sees a very large “menu” of candidate tokens and assigns each token a score representing how appropriate it is as the next item in the sequence. The model does not forbid any dish at this early stage; it simply ranks all options, waiting for later steps to decide which subset of tokens is actually eligible for sampling.

Internally, the model first assigns a raw, unnormalized score—called a logit—to every possible token in its vocabulary. These logits can be thought of as the chef’s private ranking of all dishes on the menu before any constraints are applied. At this point, no token has been selected and none has been removed from consideration. The next stages in the decoding process transform these raw scores into a probability distribution and then prune or reshape that distribution according to user‑controlled parameters. Only after these transformations does the model draw a single token from the remaining candidates and add it to the output sequence.

Temperature is the parameter that controls how conservative or adventurous this selection process is. When the temperature is low, the model sharply favors the tokens with the highest logits, making the distribution very peaky. In the kitchen analogy, the chef almost always chooses the most popular dish, yielding consistent, predictable behavior that corresponds to deterministic or near‑deterministic output. When the temperature is high, the distribution is flattened, giving more weight to lower‑ranked tokens. Here, the chef is willing to serve less popular specials, creating more diverse and surprising text at the cost of increased variability and potential incoherence. It is important to note that temperature does not decide which tokens are allowed; it only adjusts how the model trades off between the best‑scoring token and lower‑probability alternatives when making the final draw.

Top‑k sampling controls the size of the candidate set by enforcing a hard count on how many tokens are even eligible. In plain terms, top‑k means “keep only the K highest‑scoring tokens and discard all the rest.” Before probabilities are considered, the model simply truncates the ranked list at position K. Returning to the kitchen, imagine the dishes ranked by popularity: pizza, burger, pasta, salad, soup. If top‑k is set to 3, only pizza, burger, and pasta remain available; salad and soup are entirely removed from the menu for this decision, even if someone might occasionally order them. Top‑k therefore determines the maximum number of options the model may consider at any decoding step, making the behavior more predictable and reducing the chance of extremely unlikely tokens appearing.

Top‑p, or nucleus sampling, takes a different approach by focusing on cumulative probability rather than a fixed number of options. Instead of retaining a fixed K, the model sorts tokens by probability and keeps adding them to a candidate set until their combined probability mass reaches a threshold p, such as 0.75 or 0.9. In the restaurant analogy, suppose the order probabilities are 40% for pizza, 30% for burger, 15% for pasta, 10% for salad, and 5% for soup. With top‑p set to 0.75, the chef includes pizza and burger (now at 70% combined) and then adds pasta (total 85%), at which point the threshold is exceeded and the candidate pool closes. Salad and soup are excluded for this decision. If top‑p is increased to 0.9, salad is also included, because pizza + burger + pasta + salad reach 95%. In this way, top‑p adapts the candidate set size to the model’s confidence: when one token is overwhelmingly likely, the pool may contain only a few tokens; when probabilities are spread out, the pool expands to capture enough mass.

The key distinction between these parameters is what each one directly controls. Top‑k limits the number of allowed options, regardless of how their probabilities are distributed. Top‑p limits the total probability mass covered by the candidate set and therefore adjusts dynamically as the distribution changes from context to context. Temperature, by contrast, does not discard tokens; it reshapes the underlying distribution, effectively controlling how much risk the model takes in sampling from lower‑probability candidates. A useful memory aid is that top‑k counts options, top‑p counts confidence, and temperature controls risk.

Options (k), confidence (p), risk (temperature).

Operationally, the decoder follows a consistent sequence of steps on every token. The model scores all tokens with logits. Temperature scaling divides those logits by the chosen temperature, making the distribution sharper or flatter. A softmax function then converts the scaled scores into valid probabilities that sum to one. At this point, either top‑k, top‑p, or a combination of both is applied to filter the candidate pool. Finally, the model samples exactly one token from this filtered distribution and appends it to the output. This new token becomes part of the context, and the entire process repeats until a stopping condition—such as an end‑of‑sequence token or a maximum length—is reached.

Understanding these mechanics is not just an academic exercise; it directly informs how to tune LLMs in practice. For instance, setting top‑k to 1 effectively collapses the distribution to a single choice at each step, making temperature irrelevant, because there is no remaining randomness to modulate. This corresponds to greedy decoding, where the model always selects the highest‑probability token and produces highly deterministic outputs. More generally, top‑k and top‑p determine what is even possible at each step, while temperature determines how aggressively the model favors the best‑scoring option within that constrained set.

The contrast between top‑k and top‑p also has practical implications. With top‑k, the model always considers exactly K tokens, even when the K‑th token is extremely unlikely. This can inject rare, low‑probability tokens into the candidate pool in situations where the model is actually quite confident about the top one or two choices. Top‑p avoids this by expanding or contracting the candidate set based on cumulative probability. When the distribution is very skewed, top‑p may include only a small number of tokens; when it is flatter, the candidate set may contain many more tokens to reach the same threshold. This adaptive behavior often leads to more natural text because it respects the model’s varying degree of certainty across contexts.

From an engineering perspective, decoding parameters are levers for trading off creativity against reliability. Lower temperature and smaller candidate sets (low top‑k or low top‑p) yield safer, more repetitive, and more deterministic outputs, which can be desirable in domains that require high factual accuracy or tight control, such as legal summaries or code generation. Higher temperature and larger candidate sets promote diversity and novelty, which may be useful for brainstorming, creative writing, or generating multiple alternative phrasings. Selecting appropriate defaults and ranges for these parameters is therefore part of system design, not just model usage.

Ultimately, LLMs do not magically “know” the next word; they choose under constraints that we expose and configure. If you understand what is allowed through top‑k, how much confidence is required through top‑p, and how bold the model is permitted to be through temperature, you understand the core of the decoder’s behavior. Mastery of these simple concepts enables you to reason about generation failures, explain system behavior to non‑experts, and design safer, more predictable applications built on top of large language models.