Fixing common issues with Generative APIs
Below are common issues that you may encounter when using Generative APIs, their causes, and recommended solutions.
400: Bad Request - You exceeded maximum context window for this model
Cause
- You provided an input exceeding the maximum context window (also known as context length) for the model you are using.
- You provided a long input and requested a long input (in
max_completion_tokens
field), which added together, exceeds the maximum context window of the model you are using.
Solution
- Reduce your input size below what is supported by the model.
- Use a model supporting longer context window values.
- Use Managed Inference, where the context window can be increased for several configurations with additional GPU vRAM. For instance,
llama-3.3-70b-instruct
model infp8
quantization can be served with:15k
tokens context window onH100
Instances128k
tokens context window onH100-2
Instances
400: Bad Request - Invalid JSON body
Cause
- You provided a JSON body that does not follow the standard JSON structure, therefore your request cannot be analyzed.
Solution
- Verify the JSON body you send is valid:
- You can store your content in a file with the
.json
extension (eg. namedfile.json
), and open it with an IDE such as VSCode or Zed. Syntax errors should display if there are any. - You can copy your content in a JSON formatter tool or linter available online, that will identify errors.
- Usually, most common errors include:
- Missing or unnecessary quotes
"
,'
or commas,
on property names and string values. - Special characters that are not escaped, such as line break
\n
or backslash\\
- Missing or unnecessary quotes
- You can store your content in a file with the
403: Forbidden - Insufficient permissions to access the resource
Cause
- You are not providing valid credentials
- You do not have the required IAM permissions sets
- You are not connecting to the right endpoint URL
Solution
- Ensure you provide an API secret key in your API request or third party library configuration
- Ensure the IAM user or application you are connecting with has the right IAM permissions sets (either
GenerativeApisFullAccess
or a narrower one) - If you have access only to a specific Project, ensure you are connecting to this Project URL:
- The URL format is:
https://api.scaleway.ai/{project_id}/v1"
- If no
project_id
is specified in the URL (https://api.scaleway.ai/v1"
), yourdefault
Project will be used.
- The URL format is:
416: Range Not Satisfiable - max_completion_tokens is limited for this model
Cause
- You provided a value for
max_completion_tokens
that is too high and not supported by the model you are using.
Solution
- Remove
max_completion_tokens
field from your request or client library, or reduce its value below what is supported by the model.- As an example, when using the init_chat_model from Langchain, you should edit the
max_tokens
value in the following configuration:llm = init_chat_model("llama-3.3-70b-instruct", max_tokens="8000", model_provider="openai", base_url="https://api.scaleway.ai/v1", temperature=0.7)
- As an example, when using the init_chat_model from Langchain, you should edit the
- Use a model supporting higher
max_completion_tokens
value. - Use Managed Inference, where these limits on completion tokens do not apply (your completion tokens amount will still be limited by the maximum context window supported by the model).
416: Range Not Satisfiable - max_completion_tokens is limited for this model
Cause
- You provided a value for
max_completion_tokens
which is too high, and not supported by the model you are using.
Solution
- Remove the
max_completion_tokens
field from your request or client library, or reduce its value below what is supported by the model.- As an example, when using the init_chat_model from Langchain, you should edit the
max_tokens
value in the following configuration:llm = init_chat_model("llama-3.3-70b-instruct", max_tokens="8000", model_provider="openai", base_url="https://api.scaleway.ai/v1", temperature=0.7)
- As an example, when using the init_chat_model from Langchain, you should edit the
- Use a model supporting a higher
max_completion_tokens
value. - Use Managed Inference, where these limits on completion tokens do not apply (your completion tokens amount will still be limited by the maximum context window supported by the model).
429: Too Many Requests - You exceeded your current quota of requests/tokens per minute
Cause
- You performed too many API requests within a given minute
- You consumed too many tokens (input and output) with your API requests over a given minute
Solution
- Smooth out your API requests rate by limiting the number of API requests you perform over a given minute, so that you remain below your Organization quotas for Generative APIs.
- Add a payment method and validate your identity to increase automatically your quotas based on standard limits.
- Reduce the size of the input or output tokens processed by your API requests.
- Use Managed Inference, where these quotas do not apply (your throughput will be only limited by the amount of Inference Deployment your provision)
- Contact your assigned Scaleway account manager or our Sales team to discuss volume commitments for specific models, which will enable us to increase your quota proportionally.
429: Too Many Requests - You exceeded your current threshold of concurrent requests
Cause
- You kept too many API requests opened at the same time (number of HTTP sessions opened in parallel)
Solution
- Smooth out your API requests rate by limiting the number of API requests you perform at the same time (eg. requests which did not receive a complete response and are still opened) so that you remain below your Organization quotas for Generative APIs.
- Use Managed Inference, where concurrent request limit do not apply. Note that exceeding the number of concurrent requests your Inference deployment can handle may impact performance metrics.
504: Gateway Timeout
Cause
- The query is too long to process (even if context-length stays between supported context window and maximum tokens)
- The model goes into an infinite loop while processing the input (which is a known structural issue with several AI models)
Solution
For queries that are too long to process:
- Set a stricter maximum token limit to prevent overly long responses.
- Reduce the size of the input tokens, or split the input into multiple API requests.
- Use Managed Inference, where no query timeout is enforced.
For queries where the model enters an infinite loop (more frequent when using structured output):
- Set
temperature
to the default value recommended for the model. These values can be found in the Generative APIs Playground when selecting the model. Avoid using temperature0
, as this can lock the model into outputting only the next (and same) most probable token repeatedly. - Ensure the
top_p
parameter is not set too low (we recommend the default value of1
). - Add a
presence_penalty
value in your request (0.5
is a good starting value). This option will help the model choose different tokens than the one it is looping on, although it might impact accuracy for some tasks requiring repeated multiple similar outputs. - Use more recent models, which are usually more optimized to avoid loops, especially when using structured output.
- Optimize the system prompt to provide clearer and simpler tasks. Currently, JSON output accuracy still relies on heuristics to constrain models to output only valid JSON tokens, and thus depends on the prompts given. As a counter-example, providing contradictory requirements to a model - such as
Never output JSON
in the system prompt andresponse_format
asjson_schema
in the query - may lead to the model never outputting closing JSON brackets}
.
Structured output (e.g., JSON) is not working correctly
Description
- Structured output response contains invalid JSON
- Structured output response is valid JSON but content is less relevant
- Structured output response never ends (loop over characters such as
"
,\t
or\n
). For this issue, see the advice on infinite loops in 504 Gateway Timeout.
Causes
- Incorrect field naming in the request, such as using
"format"
instead of the correct"response_format"
field. - Lack of a JSON schema, which can lead to ambiguity in the output structure.
- Maximum tokens is lower than what the model response needs to be complete.
- Temperature is not set or set too high.
Solution
- Ensure the proper field
"response_format"
is used in the query. - Provide a JSON schema in the request to guide the model's structured output.
- Ensure the
max_tokens
value is higher than the responsecompletion_tokens
value. If this is not the case, the model answer may be stripped down before it can finish the proper JSON structure (and lack closing JSON brackets}
for example). Note that if themax_tokens
value is not set in the query, default values apply for each model. - Ensure the
temperature
value is set with a lower range value for the model. If this is not the case, the model answer may output invalid JSON characters. Note that if thetemperature
value is not set in the query, default values apply for each model. As examples:- for
llama-3.3-70b-instruct
,temperature
should be set lower than0.6
- for
mistral-nemo-instruct-2407
,temperature
should be set lower than0.3
- for
- Refer to the documentation on structured outputs for examples and additional guidance.
Multiple "role": "user" successive messages
Cause
- Successive messages with
"role": "user"
are sent in the API request instead of alternating between"role": "user"
and"role": "assistant"
.
Solution
- Ensure the
"messages"
array alternates between"role": "user"
and"role": "assistant"
. - If multiple
"role": "user"
messages need to be sent, concatenate them into one"role": "user"
message or intersperse them with appropriate"role": "assistant"
responses.
Example error message (for Mistral models)
{
"object": "error",
"message": "After the optional system message, conversation roles must alternate user/assistant/user/assistant/...",
"type": "BadRequestError",
"param": null,
"code": 400
}
Tokens consumption is not displayed in Cockpit metrics
Causes
- Cockpit is isolated by
project_id
and only displays token consumption related to one Project. - Cockpit
Tokens Processed
graphs along time can take up to an hour to update (to provide more accurate average consumptions over time). The overallTokens Processed
counter is updated in real-time.
Solution
- Ensure you are connecting to the Cockpit corresponding to your Project. Cockpits are currently isolated by
project_id
, which you can see in their URL:https://PROJECT_ID.dashboard.obs.fr-par.scw.cloud/
. This Project should correspond to the one used in the URL you used to perform Generative APIs requests, such ashttps://api.scaleway.ai/{PROJECT_ID}/v1/chat/completions
. You can list your projects and their IDs in your Organization dashboard.
Example error behavior
- When displaying the wrong Cockpit for the Project:
- Counter for Tokens Processed or API Requests should display a value of 0
- Graph across time should be empty
- When displaying the Cockpit of a specific Project, but waiting for average token consumption to display:
- Counter for Tokens Processed or API Requests should display a correct value (different from 0)
- Graph across time should be empty
Embedding vectors cannot be stored in a database or used with a third-party library
Cause
The embedding model you are using generates vector representations with a fixed dimension number, which is too high for your database or third-party library.
- For example, the embedding model
bge-multilingual-gemma2
generates vector representations with3584
dimensions. However, when storing vectors using PostgreSQLpgvector
extensions, indexes (inhnsw
orivvflat
formats) only support up to2000
dimensions.
Solution
- Use a vector store supporting higher dimensions numbers, such as Qdrant.
- Do not use indexes for vectors or disable them from your third-party library. This may limit performance in vector similarity search for significant volumes.
- When using Langchain PGVector method, this method does not create an index by default and should not raise errors.
- When using the Mastra library with
vectorStoreName: "pgvector"
, specify indexConfig type asflat
to avoid creating any index on vector dimensions.
await vectorStore.createIndex({ indexName: 'papers', dimension: 3584, indexConfig: {"type":"flat"}, });
- Use a model with a lower number of dimensions. Using Managed Inference, you can deploy for instance the
sentence-t5-xxl
model, which represents vectors with768
dimensions.
Previous messages are not taken into account by the model
Causes
- Previous messages are not sent to the model
- The content sent exceeds the maximum context window for this model
Solution
- LLM models are completely "stateless" and thus do not store previous messages or conversations. For example, when building a chatbot application, each time a new message is sent by the user, all preceding messages in the conversation need to be sent through the API payload. Example payload for multi-turn conversation:
from openai import OpenAI
client = OpenAI(
base_url="https://api.scaleway.ai/v1",
api_key=os.getenv("SCW_SECRET_KEY")
)
response = client.chat.completions.create(
model="llama-3.1-8b-instruct",
messages=[
{
"role": "user",
"content": "What is the solution to 1+1= ?"
},
{
"role": "assistant",
"content": "2"
},
{
"role": "user",
"content": "Double this number"
}
]
)
print(response.choices[0].message.content)
This snippet will output the model response, which is 4
.
- When exceeding the maximum context window, you should receive a
400 - BadRequestError
detailing the context length value you exceeded. In this case, you should reduce the size of the content you send to the API.
Best practices for optimizing model performance
Input size management
- Avoid overly long input sequences; break them into smaller chunks if needed.
- Use summarization techniques for large inputs to reduce token count while maintaining relevance.
Use proper parameter configuration
- Double-check parameters like
"temperature"
,"max_tokens"
, and"top_p"
to ensure they align with your use case. - For structured output, always include a
"response_format"
and, if possible, a detailed JSON schema.
Debugging silent errors
- For cases where no explicit error is returned:
- Verify all fields in the API request are correctly named and formatted.
- Test the request with smaller and simpler inputs to isolate potential issues.
Still need help?Create a support ticket