A chat completion is a model response for a given conversation. It represents the functionality of generating a response in a chat context
Create a chat completion
Create a model response for a given chat conversation. This method accepts a sequence of messages (a chat conversation) and returns a response generated by the model.
Conversation messages are not stored and need to be sent in each
/chat/completions API call.
path Parameters
project_idThe ID of the Project you want to target. If this value is not provided, your default Project will be used.
Specifying this value allows you to limit access through IAM policies, or to allocate consumption and billing to a specific project.
Create a chat completion › Request Body
modelUnique identifier of the model, such as llama-3.3-70b-instruct or mistral-small-3.2-24b-instruct-2506.
Refer to our supported modelsOpen in new context list or /models endpoint for available models.
Array of messages representing the conversation history.
max_completion_tokensMaximum number of output tokens that can be generated
for a completion.
Different default maximum valuesOpen in new context
are enforced for each model, to avoid edge cases where tokens are
generated indefinitely. These values are not enforced
in Managed InferenceOpen in new context.
frequency_penaltyValue which influences the likelihood of generating tokens based on their frequency in the existing text. When set to a positive value, it reduces the probability of repeating tokens that have already appeared.
List of token IDs with associated bias integer values ranging from -100 to 100. This parameter adjusts the probability of these tokens being generated during the model's output.
A JSON object must be provided in the following format: {"354": 80,"143": -50} where 354 and 143 are token IDs from the tokenizer used with this model. Positive values increase the likelihood of a token being generated, while negative values reduce it.
Model qwen3.5-397b-a17b does not support this field.
logprobsDefines whether to return log probabilities of each output token. This allows you to see the likelihood of each token being generated.
nNumber of chat completion choices to generate for a given input.
The value of n multiplies the number of generated tokens,
resulting in n separate responses for each input.
parallel_tool_callsDefines whether the model can call multiple tools. Currently, even if
set false this parameter will be ignored and act as if set to true.
Only specific modelsOpen in new context can call multiple tools in a single response.
Default value: true
presence_penaltyValue which influences the probability of generating tokens that have already appeared in the text. Positive values reduce the likelihood of repeating a token, regardless of how many times it has already appeared.
reasoning_effortReasoning effort level to generate the response. minimal is currently not supported.
For qwen3.5-397b-a17b model:
nonevalue is supportedlowandhighvalues are similar tomedium
For gpt-oss-120b model:
nonevalue is not supported
Output format specification.
Using { "type": "json_schema", "json_schema": {...} }
enables the model to output only a valid JSON following the provided schema specification.
Deprecated. Using { "type": "json_object" } enables JSON mode that should not be used anymore.
See How to use structured outputsOpen in new context
for code snippets using openai Python client and the JSON Schema referenceOpen in new context
for documentation about the format.
seedValue which controls the randomness of the output to ensure determinism. When using the same seed value along with identical input and parameters, you should receive the same model response each time. This holds true even when temperature is set above 0.
Note that fully deterministic output is not guaranteed over long periods of time (such as several months), as the inference model may be updated and optimized.
stopString, or array of strings, that when encountered in the generated text will stop the model from generating further output tokens. The generated text will not return any of the specified stop sequences. A maximum of 4 sequences can be provided.
streamDefines whether the model's response can be streamed to the client using server-sent events.
The response will be streamed in chunks over HTTP, where each chunk except the last contains the following content:
data: {"id": ..., "model": ..., "choices":...}
The last chunk will contain data: [DONE].
Note that the object {"id": ..., "model": ..., "choices":...} follows the same format as
a non-stream HTTP request.
See How to query language models using streamingOpen in new context for examples, and server-sent eventsOpen in new context for reference documentation about SSE format.
Default value: false
An object containing parameters that modify the behavior of stream responses. Can only be used if stream is set to true.
temperatureValue between 0 and 2 which increases randomness in token generation (e.g. encourages content "creativity" instead of "predictability").
temperature:0 means the distribution learned by the model will be used directly, favoring a subset of the most probable tokens at each generation step.
temperature>0 means randomness is added to the learnt distribution, so that tokens with a lower probability can also be generated.
temperature>=1 means added randomness will be so high, that almost all tokens are equally probable, leading the model to potentially mix languages.
The ideal temperature value depends on the use case and model. We recommend setting temperature to the recommended value for each model,
as shown in Console Playground (these values are used by default).
Note that temperature does not affect request reproducibility (only affected by the seed parameter).
With the same seed and temperature, two identical requests to a model will generate the same response.
List of tools the model can call, such as functions. A maximum of 128 tools can be provided. See How to use function callingOpen in new context for code snippets using openai Python client.
tool_choiceDefines whether a model can call tools, and if so, and which ones.
none: model will not call any tools, and only generate a message.
auto: model can choose either to generate a message, or to call one or
multiple tools.
required: model must call one or multiple tools.
Default: none when no tools are present, otherwise auto.
An object can also be provided to specify a tool that the model must call. Object format must be:
{"type": "function", "function": {"name": "function_name_as_provided_in_tools"}}
top_logprobsNumber of most likely tokens to return for each token generated, along with their
generation log probability.
Value must be between 0 and 20.
logprobs must be set to true to use this parameter.
top_pValue between 0 and 1 which increases the proportion of token vocabulary considered during generation (0 cannot be used).
top_p:0.9 means the next token will be chosen from the 90% most probable tokens at each generation step.
We recommend setting top_p to the recommended value for each model, as shown in Console Playground (these values are used by default).
max_tokensUse max_completion_tokens instead. Maximum number of total tokens that can be generated
for a completion (input and output).
Create a chat completion › Responses
idUUID of the response.
objectType of response object, always set to chat.completion.
createdTimestamp when the response was generated (Unix format, in seconds).
modelUnique identifier of the model.
List of chat completion variations. Defaults to only 1 choice, but can be increased by setting a value for n in the request.