While building my own LLM-based application, I found many prompt engineering guides, but few equivalent guides for determining the temperature setting.
Of course, temperature is a simple numerical value while prompts can get mindblowingly complex, so it may feel trivial as a product decision. Still, choosing the right temperature can dramatically change the nature of your outputs, and anyone building a production-quality LLM application should choose temperature values with intention.
In this post, weāll explore what temperature is and the math behind it, potential product implications, and how to choose the right temperature for your LLM application and evaluate it. At the end, I hope that youāll have a clear course of action to find the right temperature for every LLM use case.
What is temperature?
Temperature is a number that controls the randomness of an LLMās outputs. Most APIs limit the value to be from 0 to 1 or some similar range to keep the outputs in semantically coherent bounds.
From OpenAIās documentation:
āHigher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.ā
Intuitively, itās like a dial that can adjust how āexplorativeā or āconservativeā the model is when it spits out an answer.
What do these temperature values mean?
Personally, I find the math behind the temperature field very interesting, so Iāll dive into it. But if youāre already familiar with the innards of LLMs or youāre not interested in them,Ā feel free to skip this section.
You probably know that an LLM generates text by predicting the next token after a given sequence of tokens. In its prediction process, it assigns probabilities to all possible tokens that could come next. For example, if the sequence passed to the LLM is āThe giraffe ran over to theā¦ā, it might assign high probabilities to words like ātreeā or āfenceā and lower probabilities to words like āapartmentā or ābookā.
But letās back up a bit. How do these probabilities come to be?
These probabilities usually come from raw scores, known asĀ logits, that are the results of many, many neural network calculations and other Machine Learning techniques. These logits are gold; they contain all the valuable information about what tokens could be selected next. But the problem with these logits is that they donāt fit the definition of a probability: they can be any number, positive or negative, like 2, or -3.65, or 20. Theyāre not necessarily between 0 and 1, and they donāt necessarily all add up to 1 like a nice probability distribution.
So, to make these logits usable, we need to use a function to transform them into a clean probability distribution. The function typically used here is called theĀ softmax, and itās essentially an elegant equation that does two important things:
- It turns all the logits into positive numbers.
- It scales the logits so they add up to 1.

The softmax function works by taking each logit, raisingĀ eĀ (around 2.718) to the power of that logit, and then dividing by the sum of all these exponentials. So the highest logit will still get the highest numerator, which means it gets the highest probability. But other tokens, even with negative logit values, will still get a chance.
Now hereās where Temperature comes in:Ā temperature modifies the logits before applying softmax.Ā The formula for softmax with temperature is:

When the temperature isĀ low, dividing the logits by T makes the values larger/more spread out. Then the exponentiation would make the highest value much larger than the others, making the probability distribution more uneven. The model would have a higher chance of picking the most probable token, resulting in aĀ more deterministicĀ output.
When the temperature isĀ high,Ā dividing the logits by T makes all the values smaller/closer together, spreading out the probability distribution more evenly. This means the model is more likely to pick less probable tokens, increasingĀ randomness.
How to choose temperature
Of course, the best way to choose a temperature is to play around with it. I believe any temperature, like any prompt, should be substantiated with example runs and evaluated against other possibilities. Weāll discuss that in the next section.
But before we dive into that, I want to highlight thatĀ temperature is a crucial product decision, one that can significantly influence user behavior. It may seem rather straightforward to choose: lower for more accuracy-based applications, higher for more creative applications. But there are tradeoffs in both directions with downstream consequences for user trust and usage patterns. Here are some subtleties that come to mind:
- Low temperatures can make the product feel authoritative. More deterministic outputs can create the illusion of expertise and foster user trust. However, this can also lead to gullible users. If responses are always confident, users might stop critically evaluating the AIās outputs and just blindly trust them, even if theyāre wrong.
- Low temperatures can reduce decision fatigue. If you see one strong answer instead of many options, youāre more likely to take action without overthinking. This might lead to easier onboarding or lower cognitive load while using the product. Inversely, high temperatures could create more decision fatigue and lead to churn.
- High temperatures can encourage user engagement. The unpredictability of high temperatures can keep users curious (like variable rewards), leading to longer sessions or increased interactions. Inversely, low temperatures might create stagnant user experiences that bore users.
- Temperature can affect the way users refine their prompts.Ā When answers are unexpected with high temperatures, users might be driven toĀ clarifyĀ their prompts. But with low temperatures, users may be forced toĀ add more detail or expand onĀ their prompts in order to get new answers.
These are broad generalizations, and of course there are many more nuances with every specific application. But in most applications, the temperature can be a powerful variable to adjust in A/B testing, something to consider alongside your prompts.
Evaluating different temperatures
As developers, weāre used to unit testing: defining a set of inputs, running those inputs through a function, and getting a set of expected outputs. We sleep soundly at night when we ensure that our code is doing what we expect it to do and that our logic is satisfying some clear-cut constraints.
TheĀ promptfooĀ package lets you perform the LLM-prompt equivalent of unit testing, but thereās some additional nuance. Because LLM outputs are non-deterministic and often designed to do more creative tasks than strictly logical ones, it can be hard to define what an āexpected outputā looks like.
Defining your āexpected outputā
The simplest evaluation tactic is to have aĀ humanĀ rate how good they think some output is, according to some rubric. For outputs where youāre looking for a certain āvibeā that you canāt express in words, this will probably be the most effective method.
Another simple evaluation tactic is to useĀ deterministic metricsĀ ā these are things like ādoes the output contain a certain string?ā or āis the output valid json?ā or ādoes the output satisfy this javascript expression?ā. If your expected output can be expressed in these ways,Ā promptfoo has your back.
A more interesting, AI-age evaluation tactic is to useĀ LLM-graded checks. These essentially use LLMs to evaluate your LLM-generated outputs, and can be quite effective if used properly. Promptfoo offers these model-graded metrics in multiple forms. The whole list isĀ here, and it contains assertions from āis the output relevant to the original query?ā to ācompare the different test cases and tell me which one is best!ā to āwhere does this output rank on this rubric I defined?ā.
Example
Letās say Iām creating a consumer-facing application that comes up with creative gift ideas and I want to empirically determine what temperature I should use with my main prompt.
I might want to evaluate metrics like relevance, originality, and feasibility within a certain budget and make sure that Iām picking the right temperature to optimize those factors. If Iām comparing GPT 4o-miniās performance with temperatures of 0 vs. 1, my test file might start like this:
providers:
Ā - id: openai:gpt-4o-mini
Ā Ā Ā label: openai-gpt-4o-mini-lowtemp
Ā Ā Ā config:
Ā Ā Ā Ā Ā temperature: 0
Ā - id: openai:gpt-4o-mini
Ā Ā Ā label: openai-gpt-4o-mini-hightemp
Ā Ā Ā config:
Ā Ā Ā Ā Ā temperature: 1
prompts:
Ā - "Come up with a one-sentence creative gift idea for a person who is {{persona}}. It should cost under {{budget}}."tests:
Ā - description: "Mary - attainable, under budget, original"
Ā Ā Ā vars:
Ā Ā Ā Ā Ā persona: "a 40 year old woman who loves natural wine and plays pickleball"
Ā Ā Ā Ā Ā budget: "$100"
Ā Ā Ā assert:
Ā Ā Ā Ā Ā - type: g-eval
Ā Ā Ā Ā Ā Ā Ā value:
Ā Ā Ā Ā Ā Ā Ā Ā Ā - "Check if the gift is easily attainable and reasonable"
Ā Ā Ā Ā Ā Ā Ā Ā Ā - "Check if the gift is likely under $100"
Ā Ā Ā Ā Ā Ā Ā Ā Ā - "Check if the gift would be considered original by the average American adult"
Ā - description: "Sean - answer relevance"
Ā Ā Ā vars:
Ā Ā Ā Ā Ā persona: "a 25 year old man who rock climbs, goes to raves, and lives in Hayes Valley"
Ā Ā Ā Ā Ā budget: "$50"
Ā Ā Ā assert:
Ā Ā Ā Ā Ā - type: answer-relevance
Ā Ā Ā Ā Ā Ā Ā threshold: 0.7
Iāll probably want to run the test cases repeatedly to test the effects of temperature changes across multiple same-input runs. In that case, I would use the repeat param like:
promptfoo eval --repeat 3

Conclusion
Temperature is a simple numerical parameter, but donāt be deceived by its simplicity: it can have far-reaching implications for any LLM application.
Tuning it just right is key to getting the behavior you want ā too low, and your model plays it too safe; too high, and it starts spouting unpredictable responses. With tools like promptfoo, you can systematically test different settings and find your Goldilocks zone ā not too cold, not too hot, but just right. ļø
[ad_2]