What Is AI Model Temperature and How to Set It
Temperature is the most misunderstood setting in the entire LLM API stack. Most developers leave it at the default, and most tutorials get the math wrong.
AI model temperature is a sampling parameter that scales the model's output logits before the softmax function, controlling how sharply or flatly probability is distributed across possible next tokens — lower values make output more deterministic, higher values make it more random.
TL;DR
- Temperature scales logits before softmax — it reshapes the probability distribution the model samples from
- OpenAI supports 0 to 2, Anthropic supports 0 to 1, Google Gemini supports 0 to 2, all with a default of 1.0 in 2026
- Use 0 to 0.3 for code, classification, and factual Q&A; 0.7 to 1.0 for conversation; 0.8 to 1.2 for creative writing
- Temperature 0 is NOT fully deterministic in practice due to floating-point math and batching on GPUs
- Never set temperature and top_p aggressively together — pick one as your primary sampling control
What Temperature Actually Does Mathematically
Every time a language model generates a token, it produces a vector of logits — raw, unnormalized scores, one per token in the vocabulary. Those logits get converted to probabilities using the softmax function. The model then samples from that probability distribution.
Temperature is a scalar that divides the logits before softmax. Lower temperature makes the biggest logit dominate; higher temperature flattens the differences.
At temperature 0, softmax collapses into an argmax — the model always picks the single most probable token. This is called greedy decoding.
At temperature 1, softmax runs on the raw logits unchanged. This is the "native" distribution the model was trained on.
At temperature 2, the logits get squashed toward equal, so rare tokens get sampled more often. Output quality usually degrades above 1.2 because low-probability tokens introduce noise faster than creativity.
This is why temperature is not "randomness" in the casual sense. It's a reshape of an existing distribution. If the model is 99% confident in one token, raising temperature to 1.5 still leaves that token as the most likely choice — it just gives alternatives more chance.
Temperature Ranges and Defaults Across the Major APIs
Pulled directly from official documentation in April 2026.
| Model / API | Min | Max | Default |
|---|---|---|---|
| OpenAI GPT-4o, GPT-4.1 | 0.0 | 2.0 | 1.0 |
| OpenAI Realtime API | 0.6 | 1.2 | 0.8 |
| Anthropic Claude | 0.0 | 1.0 | 1.0 |
| Google Gemini 2.x | 0.0 | 2.0 | 1.0 |
| Cohere Command | 0.0 | 5.0 | 0.3 |
| Mistral API | 0.0 | 1.5 | 0.7 |
Two gotchas people run into.
Claude caps at 1.0, not 2.0. If you're porting a prompt from GPT-4 where you used temperature 1.3, you can't replicate it on Claude. Anthropic also changed their Console default from 0 to 1 in 2025 to match the API default.
The Realtime API has a tighter range. OpenAI recommends sticking near 0.8 for Realtime because the lower bound is 0.6, and the voice pipeline behaves unpredictably outside the narrow window.
Recommended Temperature Settings by Use Case
After working across dozens of production LLM applications, these ranges hold up consistently.
| Use case | Recommended temperature | Why |
|---|---|---|
| Classification, labeling, extraction | 0.0 to 0.2 | Determinism matters more than style |
| Code generation | 0.0 to 0.2 | Structural correctness is binary |
| Factual Q&A, RAG | 0.1 to 0.3 | Minimizes paraphrase drift from sources |
| Summarization | 0.3 to 0.6 | Balances fidelity with readability |
| Professional writing, email | 0.4 to 0.7 | Natural tone without weird word choices |
| General chat | 0.7 to 1.0 | Industry-standard conversational balance |
| Creative writing, fiction | 0.8 to 1.2 | Diverse word choice and phrasing |
| Brainstorming, ideation | 1.0 to 1.3 | Maximum exploration of concept space |
Run an A/B test. For any production prompt, generate 10 outputs at temperature 0.3, 0.7, and 1.0. Score each for accuracy, tone, and variance. The right setting almost always sits in a narrow window you'd miss by sticking to defaults.
Temperature vs Top_p vs Top_k: When to Use Which
These three parameters all control sampling, but they do different things.
Temperature reshapes the entire probability distribution. Every token's probability changes.
Top_k filters the distribution to only the k most probable tokens, then samples from those. A fixed-size candidate pool. Default is usually 40 or 50 where exposed; OpenAI doesn't expose it directly.
Top_p (nucleus sampling) filters to the smallest set of tokens whose cumulative probability is at least p (typically 0.9 or 0.95). Adaptive pool size — it shrinks when the model is confident, grows when it's uncertain.
The practical rule: use temperature alone for most cases. Temperature plus top_p works for creative tasks because top_p kills the long tail of nonsense tokens while temperature keeps diversity in the meaningful candidates. Temperature plus top_k works for structured tasks where you want predictability.
What to avoid: setting both temperature above 1 and top_p above 0.95 on the same request. You're telling the model "be random" and "keep all candidates" simultaneously, which amplifies noise. Claude 4.5 and later models refuse to accept both at once for this reason.
The Most Common Misconception: Temperature 0 Is Not Fully Deterministic
Every tutorial says "set temperature to 0 for deterministic output." In practice, this is only partially true, and the gap matters if you're building eval pipelines or reproducibility-critical systems.
Temperature 0 makes the sampling step deterministic — it forces greedy selection of the highest-probability token. But the rest of the inference pipeline is not deterministic.
Three things make identical prompts produce different outputs even at temperature 0:
-
GPU floating-point non-associativity. Multiplying three floats in different orders gives slightly different results. Parallel matrix multiplies on GPUs reorder operations based on load, so the same computation can yield slightly different logits run-to-run.
-
Batching effects. Production inference servers batch requests together. Which requests get grouped affects how attention and layer norm behave at the edges, subtly shifting logits.
-
Mixture-of-Experts routing. Modern MoE models like GPT-4, Gemini, and newer Claude models route tokens to experts based on the batch, not just the input. A paper from 2024 showed MoE models are "batch-deterministic" but not "sequence-deterministic."
If you need true reproducibility, cache normalized prompt-plus-parameters hashes and serve the first completion for identical inputs. Don't rely on temperature 0.
Common Mistakes Developers Make
Accepting defaults. The default is a compromise across all possible use cases. Your use case has a better setting. Test.
Confusing low temperature with high accuracy. Temperature 0 gives the most probable token, not the correct one. If the model was trained on bad data, low temperature just locks in the bad answer. Accuracy comes from grounding (RAG, tools, structured output), not temperature.
Setting temperature and top_p both aggressively. As covered above, this produces noise. Pick one.
Using temperature above 1.2 for code. Output falls apart fast. Syntax errors and hallucinated APIs explode past 1.0.
Porting temperature values across models. Temperature 0.7 on Claude is not the same distribution as 0.7 on GPT-4. Each model's native logits have different spreads. Recalibrate per model.
How to Tune Temperature for a Production Workflow
A repeatable process for picking the right temperature on any new prompt.
Step one: start at the default. Generate 5 outputs at temperature 1.0 and read them. If they look right, keep it and move on. Most use cases don't need tuning.
Step two: if output feels chaotic, drop temperature. Move in 0.2 increments: 0.8, 0.6, 0.4. At each step, generate 5 outputs and check whether they're still useful. Stop when you lose desirable variation.
Step three: if output feels flat, raise temperature. Move up in 0.1 increments: 1.1, 1.2, 1.3. Watch for hallucination or degraded grammar. Stop the moment quality drops.
Step four: lock the value with an eval. Pick 20 representative inputs. Run each at your chosen temperature three times. Score for correctness, tone, and variance. If variance is unacceptably high, lower temperature by 0.1. If outputs are too repetitive, raise by 0.1.
This takes 20 minutes and saves weeks of "why is the model acting weird?" debugging later. Every production prompt deserves it.
Never tune temperature during a live incident. If your production output quality degrades, check for model version changes, prompt regressions, or upstream data issues first. Temperature rarely shifts on its own — if output behavior changed, something else did too.
Related Reading on Zarif Automates
For more on controlling LLM outputs, see what is prompt engineering and why it matters, zero-shot vs few-shot prompting, and token limits and why they matter.
What is a good temperature setting for ChatGPT or Claude API?
For most general use, 0.7 works well — balanced and conversational. For code, classification, or factual extraction, drop to 0.0 to 0.2. For creative writing, brainstorming, or ideation, go to 0.9 to 1.2. OpenAI and Google Gemini default to 1.0 and allow up to 2.0; Anthropic Claude defaults to 1.0 and caps at 1.0.
Why does my LLM give different outputs with temperature 0?
Temperature 0 only makes the token sampling step deterministic. GPU floating-point non-associativity, request batching on inference servers, and mixture-of-experts routing all introduce variance that temperature cannot control. For true reproducibility, cache outputs keyed to the full prompt plus parameter set, rather than relying on temperature alone.
Should I use temperature or top_p to control LLM output?
Use temperature alone as your default control. Temperature reshapes the full probability distribution, while top_p filters the candidate set. If you need both diversity and structure, combine a moderate temperature (0.7 to 0.9) with top_p at 0.9 to 0.95. Avoid setting both to extreme values simultaneously — recent models like Claude 4.5 reject that configuration outright.
Why does Anthropic Claude cap temperature at 1.0 when OpenAI allows 2.0?
Different sampling implementations and different native logit distributions. Anthropic chose 0 to 1 as the supported range because outputs above 1 on their models degrade quality rapidly with little creative benefit. OpenAI's 0 to 2 range gives more headroom but most practitioners never use above 1.3 because of the same quality drop.
What's the difference between temperature 1.0 in OpenAI and 1.0 in Claude?
Temperature 1.0 means "sample from the native distribution" in both, but each model's native distribution is different because they were trained on different data with different architectures. Temperature 1.0 on Claude tends to feel tighter and more aligned with the training objective; temperature 1.0 on GPT-4 tends to feel more varied. Always recalibrate per model rather than porting settings directly.
