Fixing common issues with Generative APIs
Below are common issues that you may encounter when using Generative APIs, their causes, and recommended solutions.
400: Bad Request - You exceeded maximum context window for this model
Cause
- You provided an input exceeding the maximum context window (also known as context length) for the model you are using.
- You provided a long input and requested a long input (in
max_completion_tokensfield), which added together, exceeds the maximum context window of the model you are using.
Solution
- Reduce your input size below what is supported by the model.
- If you are using a third party tool such as IDEs, you should edit their configuration to set an appropriate maximum context window for the model. More information for VS Code (Continue), IntelliJ (Continue) and Zed.
- Use a model supporting longer context window values.
- Use Managed Inference, where the context window can be increased for several configurations with additional GPU vRAM. For instance,
llama-3.3-70b-instructmodel infp8quantization can be served with:15ktokens context window onH100Instances128ktokens context window onH100-2Instances
400: Bad Request - Invalid JSON body
Cause
- You provided a JSON body that does not follow the standard JSON structure, therefore your request cannot be analyzed.
Solution
- Verify the JSON body you send is valid:
- You can store your content in a file with the
.jsonextension (eg. namedfile.json), and open it with an IDE such as VSCode or Zed. Syntax errors should display if there are any. - You can copy your content in a JSON formatter tool or linter available online, that will identify errors.
- Usually, most common errors include:
- Missing or unnecessary quotes
",'or commas,on property names and string values. - Special characters that are not escaped, such as line break
\nor backslash\\
- Missing or unnecessary quotes
- You can store your content in a file with the
403: Forbidden - Insufficient permissions to access the resource
Cause
- You are not providing valid credentials
- You do not have the required IAM permissions sets
- You are not connecting to the right endpoint URL
Solution
- Ensure you provide an API secret key in your API request or third party library configuration
- Ensure the IAM user or application you are connecting with has the right IAM permissions sets (either
GenerativeApisFullAccessor a narrower one) - If you have access only to a specific Project, ensure you are connecting to this Project URL:
- The URL format is:
https://api.scaleway.ai/{project_id}/v1" - If no
project_idis specified in the URL (https://api.scaleway.ai/v1"), yourdefaultProject will be used.
- The URL format is:
416: Range Not Satisfiable - max_completion_tokens is limited for this model
Cause
- You provided a value for
max_completion_tokenswhich is too high, and not supported by the model you are using.
Solution
- Remove the
max_completion_tokensfield from your request or client library, or reduce its value below what is supported by the model.- As an example, when using the init_chat_model from Langchain, you should edit the
max_tokensvalue in the following configuration:llm = init_chat_model("llama-3.3-70b-instruct", max_tokens="8000", model_provider="openai", base_url="https://api.scaleway.ai/v1", temperature=0.7)
- As an example, when using the init_chat_model from Langchain, you should edit the
- Use a model supporting a higher
max_completion_tokensvalue. - Use Managed Inference, where these limits on completion tokens do not apply (your completion tokens amount will still be limited by the maximum context window supported by the model).
429: Too Many Requests - You exceeded your current quota of requests/tokens per minute
Cause
- You performed too many API requests within a given minute
- You consumed too many tokens (input and output) with your API requests over a given minute
Solution
- Smooth out your API requests rate by limiting the number of API requests you perform over a given minute, so that you remain below your Organization quotas for Generative APIs.
- Add a payment method and validate your identity to increase automatically your quotas based on standard limits.
- Reduce the size of the input or output tokens processed by your API requests.
- Use Managed Inference, where these quotas do not apply (your throughput will be only limited by the amount of Inference Deployment your provision)
- Contact your assigned Scaleway account manager or our Sales team to discuss volume commitments for specific models, which will enable us to increase your quota proportionally.
429: Too Many Requests - You exceeded your current threshold of concurrent requests
Cause
- You kept too many API requests opened at the same time (number of HTTP sessions opened in parallel)
Solution
- Smooth out your API requests rate by limiting the number of API requests you perform at the same time (eg. requests which did not receive a complete response and are still opened) so that you remain below your Organization quotas for Generative APIs.
- Use Managed Inference, where concurrent request limit do not apply. Note that exceeding the number of concurrent requests your Inference deployment can handle may impact performance metrics.
504: Gateway Timeout
Cause
- The query is too long to process (even if context-length stays between supported context window and maximum tokens)
- The model goes into an infinite loop while processing the input (which is a known structural issue with several AI models)
Solution
For queries that are too long to process:
- Set a stricter maximum token limit to prevent overly long responses.
- Reduce the size of the input tokens, or split the input into multiple API requests.
- Use Managed Inference, where no query timeout is enforced.
For queries where the model enters an infinite loop (more frequent when using structured output):
- Set
temperatureto the default value recommended for the model. These values can be found in the Generative APIs Playground when selecting the model. Avoid using temperature0, as this can lock the model into outputting only the next (and same) most probable token repeatedly. - Ensure the
top_pparameter is not set too low (we recommend the default value of1). - Add a
presence_penaltyvalue in your request (0.5is a good starting value). This option will help the model choose different tokens than the one it is looping on, although it might impact accuracy for some tasks requiring repeated multiple similar outputs. - Use more recent models, which are usually more optimized to avoid loops, especially when using structured output.
- Optimize the system prompt to provide clearer and simpler tasks. Currently, JSON output accuracy still relies on heuristics to constrain models to output only valid JSON tokens, and thus depends on the prompts given. As a counter-example, providing contradictory requirements to a model - such as
Never output JSONin the system prompt andresponse_formatasjson_schemain the query - may lead to the model never outputting closing JSON brackets}.
Structured output (e.g., JSON) is not working correctly
Description
- Structured output response contains invalid JSON
- Structured output response is valid JSON but content is less relevant
- Structured output response never ends (loop over characters such as
",\tor\n). For this issue, see the advice on infinite loops in 504 Gateway Timeout.
Causes
- Incorrect field naming in the request, such as using
"format"instead of the correct"response_format"field. - Lack of a JSON schema, which can lead to ambiguity in the output structure.
- Maximum tokens is lower than what the model response needs to be complete.
- Temperature is not set or set too high.
Solution
- Ensure the proper field
"response_format"is used in the query. - Provide a JSON schema in the request to guide the model's structured output.
- Ensure the
max_tokensvalue is higher than the responsecompletion_tokensvalue. If this is not the case, the model answer may be stripped down before it can finish the proper JSON structure (and lack closing JSON brackets}for example). Note that if themax_tokensvalue is not set in the query, default values apply for each model. - Ensure the
temperaturevalue is set with a lower range value for the model. If this is not the case, the model answer may output invalid JSON characters. Note that if thetemperaturevalue is not set in the query, default values apply for each model. As examples:- for
llama-3.3-70b-instruct,temperatureshould be set lower than0.6 - for
mistral-nemo-instruct-2407,temperatureshould be set lower than0.3
- for
- Refer to the documentation on structured outputs for examples and additional guidance.
Multiple "role": "user" successive messages
Cause
- Successive messages with
"role": "user"are sent in the API request instead of alternating between"role": "user"and"role": "assistant".
Solution
- Ensure the
"messages"array alternates between"role": "user"and"role": "assistant". - If multiple
"role": "user"messages need to be sent, concatenate them into one"role": "user"message or intersperse them with appropriate"role": "assistant"responses.
Example error message (for Mistral models)
{
"object": "error",
"message": "After the optional system message, conversation roles must alternate user/assistant/user/assistant/...",
"type": "BadRequestError",
"param": null,
"code": 400
}Tokens consumption is not displayed in Cockpit metrics
Causes
- Cockpit is isolated by
project_idand only displays token consumption related to one Project. - Cockpit
Tokens Processedgraphs along time can take up to an hour to update (to provide more accurate average consumptions over time). The overallTokens Processedcounter is updated in real-time.
Solution
- Ensure you are connecting to the Cockpit corresponding to your Project. Cockpits are currently isolated by
project_id, which you can see in their URL:https://PROJECT_ID.dashboard.obs.fr-par.scw.cloud/. This Project should correspond to the one used in the URL you used to perform Generative APIs requests, such ashttps://api.scaleway.ai/{PROJECT_ID}/v1/chat/completions. You can list your projects and their IDs in your Organization dashboard.
Example error behavior
- When displaying the wrong Cockpit for the Project:
- Counter for Tokens Processed or API Requests should display a value of 0
- Graph across time should be empty
- When displaying the Cockpit of a specific Project, but waiting for average token consumption to display:
- Counter for Tokens Processed or API Requests should display a correct value (different from 0)
- Graph across time should be empty
Embedding vectors cannot be stored in a database or used with a third-party library
Cause
The embedding model you are using generates vector representations with a fixed dimension number, which is too high for your database or third-party library.
- For example, the embedding model
bge-multilingual-gemma2generates vector representations with3584dimensions. However, when storing vectors using PostgreSQLpgvectorextensions, indexes (inhnsworivvflatformats) only support up to2000dimensions.
Solution
- Use a vector store supporting higher dimensions numbers, such as Qdrant.
- Do not use indexes for vectors or disable them from your third-party library. This may limit performance in vector similarity search for significant volumes.
- When using Langchain PGVector method, this method does not create an index by default and should not raise errors.
- When using the Mastra library with
vectorStoreName: "pgvector", specify indexConfig type asflatto avoid creating any index on vector dimensions.
await vectorStore.createIndex({ indexName: 'papers', dimension: 3584, indexConfig: {"type":"flat"}, }); - Use a model with a lower number of dimensions. Using Managed Inference, you can deploy for instance the
sentence-t5-xxlmodel, which represents vectors with768dimensions.
Previous messages are not taken into account by the model
Causes
- Previous messages are not sent to the model
- The content sent exceeds the maximum context window for this model
Solution
- LLM models are completely "stateless" and thus do not store previous messages or conversations. For example, when building a chatbot application, each time a new message is sent by the user, all preceding messages in the conversation need to be sent through the API payload. Example payload for multi-turn conversation:
from openai import OpenAI
client = OpenAI(
base_url="https://api.scaleway.ai/v1",
api_key=os.getenv("SCW_SECRET_KEY")
)
response = client.chat.completions.create(
model="llama-3.1-8b-instruct",
messages=[
{
"role": "user",
"content": "What is the solution to 1+1= ?"
},
{
"role": "assistant",
"content": "2"
},
{
"role": "user",
"content": "Double this number"
}
]
)
print(response.choices[0].message.content)This snippet will output the model response, which is 4.
- When exceeding the maximum context window, you should receive a
400 - BadRequestErrordetailing the context length value you exceeded. In this case, you should reduce the size of the content you send to the API.
Best practices for optimizing model performance
Input size management
- Avoid overly long input sequences; break them into smaller chunks if needed.
- Use summarization techniques for large inputs to reduce token count while maintaining relevance.
Use proper parameter configuration
- Double-check parameters like
"temperature","max_tokens", and"top_p"to ensure they align with your use case. - For structured output, always include a
"response_format"and, if possible, a detailed JSON schema.
Debugging silent errors
- For cases where no explicit error is returned:
- Verify all fields in the API request are correctly named and formatted.
- Test the request with smaller and simpler inputs to isolate potential issues.
Still need help?Create a support ticket