Fixing common issues with Generative APIs

Reviewed on July 21, 2025

Below are common issues that you may encounter when using Generative APIs, their causes, and recommended solutions.

400: Bad Request - You exceeded maximum context window for this model

Cause

You provided an input exceeding the maximum context window (also known as context length) for the model you are using.
You provided a long input and requested a long input (in max_completion_tokens field), which added together, exceeds the maximum context window of the model you are using.

Solution

Reduce your input size below what is supported by the model.
- If you are using a third party tool such as IDEs, you should edit their configuration to set an appropriate maximum context window for the model. More information for VS Code (Continue), IntelliJ (Continue) and Zed.
Use a model supporting longer context window values.
Use Managed Inference, where the context window can be increased for several configurations with additional GPU vRAM. For instance, llama-3.3-70b-instruct model in fp8 quantization can be served with:
- 15k tokens context window on H100 Instances
- 128k tokens context window on H100-2 Instances

400: Bad Request - Invalid JSON body

Cause

You provided a JSON body that does not follow the standard JSON structure, therefore your request cannot be analyzed.

Solution

Verify the JSON body you send is valid:
- You can store your content in a file with the .json extension (eg. named file.json), and open it with an IDE such as VSCode, Zed or Cursor. Syntax errors should display if there are any.
- You can copy your content in a JSON formatter tool or linter available online, that will identify errors.
- Usually, most common errors include:
  - Missing or unnecessary quotes ", ' or commas , on property names and string values.
  - Special characters that are not escaped, such as line break \n or backslash \\

403: Forbidden - Insufficient permissions to access the resource

Cause

You are not providing valid credentials
You do not have the required IAM permissions sets
You are not connecting to the right endpoint URL

Solution

Ensure you provide an API secret key in your API request or third party library configuration
Ensure the IAM user or application you are connecting with has the right IAM permissions sets (either GenerativeApisFullAccess or a narrower one)
If you have access only to a specific Project, ensure you are connecting to this Project URL:
- The URL format is: https://api.scaleway.ai/{project_id}/v1"
- If no project_id is specified in the URL (https://api.scaleway.ai/v1"), your default Project will be used.

416: Range Not Satisfiable - max_completion_tokens is limited for this model

Cause

You provided a value for max_completion_tokens which is too high, and not supported by the model you are using.

Solution

Remove the max_completion_tokens field from your request or client library, or reduce its value below what is supported by the model.
- As an example, when using the init_chat_model from Langchain, you should edit the max_tokens value in the following configuration:
```
llm = init_chat_model("llama-3.3-70b-instruct", max_tokens="8000", model_provider="openai", base_url="https://api.scaleway.ai/v1", temperature=0.7)
```
Use a model supporting a higher max_completion_tokens value.
Use Managed Inference, where these limits on completion tokens do not apply (your completion tokens amount will still be limited by the maximum context window supported by the model).

429: Too Many Requests - You exceeded your current quota of requests/tokens per minute

Cause

You performed too many API requests within a given minute
You consumed too many tokens (input and output) with your API requests over a given minute

Solution

Smooth out your API requests rate by limiting the number of API requests you perform over a given minute, so that you remain below your Organization quotas for Generative APIs.
Add a payment method and validate your identity to increase automatically your quotas based on standard limits.
Reduce the size of the input or output tokens processed by your API requests.
Use Managed Inference, where these quotas do not apply (your throughput will be only limited by the amount of Inference Deployment your provision)
Contact your assigned Scaleway account manager or our Sales team to discuss volume commitments for specific models, which will enable us to increase your quota proportionally.

429: Too Many Requests - You exceeded your current threshold of concurrent requests

Cause

You kept too many API requests opened at the same time (number of HTTP sessions opened in parallel)

Solution

Smooth out your API requests rate by limiting the number of API requests you perform at the same time (eg. requests which did not receive a complete response and are still opened) so that you remain below your Organization quotas for Generative APIs.
Use Managed Inference, where concurrent request limit do not apply. Note that exceeding the number of concurrent requests your Inference deployment can handle may impact performance metrics.

504: Gateway Timeout

Cause

The query is too long to process (even if context-length stays between supported context window and maximum tokens)
The model goes into an infinite loop while processing the input (which is a known structural issue with several AI models)

Solution

For queries that are too long to process:

Set a stricter maximum token limit to prevent overly long responses.
Reduce the size of the input tokens, or split the input into multiple API requests.
Use Managed Inference, where no query timeout is enforced.

For queries where the model enters an infinite loop (more frequent when using structured output):

Set temperature to the default value recommended for the model. These values can be found in the Generative APIs Playground when selecting the model. Avoid using temperature 0, as this can lock the model into outputting only the next (and same) most probable token repeatedly.
Ensure the top_p parameter is not set too low (we recommend the default value of 1).
Add a presence_penalty value in your request (0.5 is a good starting value). This option will help the model choose different tokens than the one it is looping on, although it might impact accuracy for some tasks requiring repeated multiple similar outputs.
Use more recent models, which are usually more optimized to avoid loops, especially when using structured output.
Optimize the system prompt to provide clearer and simpler tasks. Currently, JSON output accuracy still relies on heuristics to constrain models to output only valid JSON tokens, and thus depends on the prompts given. As a counter-example, providing contradictory requirements to a model - such as Never output JSON in the system prompt and response_format as json_schema in the query - may lead to the model never outputting closing JSON brackets }.

Structured output (e.g., JSON) is not working correctly

Description

Structured output response contains invalid JSON
Structured output response is valid JSON but content is less relevant
Structured output response never ends (loop over characters such as ", \t or \n). For this issue, see the advice on infinite loops in 504 Gateway Timeout.

Causes

Incorrect field naming in the request, such as using "format" instead of the correct "response_format" field.
Lack of a JSON schema, which can lead to ambiguity in the output structure.
Maximum tokens is lower than what the model response needs to be complete.
Temperature is not set or set too high.

Solution

Ensure the proper field "response_format" is used in the query.
Provide a JSON schema in the request to guide the model's structured output.
Ensure the max_tokens value is higher than the response completion_tokens value. If this is not the case, the model answer may be stripped down before it can finish the proper JSON structure (and lack closing JSON brackets } for example). Note that if the max_tokens value is not set in the query, default values apply for each model.
Ensure the temperature value is set with a lower range value for the model. If this is not the case, the model answer may output invalid JSON characters. Note that if the temperature value is not set in the query, default values apply for each model. As examples:
- for llama-3.3-70b-instruct, temperature should be set lower than 0.6
- for mistral-nemo-instruct-2407 , temperature should be set lower than 0.3
Refer to the documentation on structured outputs for examples and additional guidance.

Multiple "role": "user" successive messages

Cause

Successive messages with "role": "user" are sent in the API request instead of alternating between "role": "user" and "role": "assistant".

Solution

Ensure the "messages" array alternates between "role": "user" and "role": "assistant".
If multiple "role": "user" messages need to be sent, concatenate them into one "role": "user" message or intersperse them with appropriate "role": "assistant" responses.

Example error message (for Mistral models)

{
  "object": "error",
  "message": "After the optional system message, conversation roles must alternate user/assistant/user/assistant/...",
  "type": "BadRequestError",
  "param": null,
  "code": 400
}

Tokens consumption is not displayed in Cockpit metrics

Causes

Cockpit is isolated by project_id and only displays token consumption related to one Project.
Cockpit Tokens Processed graphs along time can take up to 5 minutes to update (to provide more accurate average consumptions over time). The overall Tokens Processed counter is updated in real-time.

Solution

Ensure you are connecting to the Cockpit corresponding to your Project. Cockpits are currently isolated by project_id, which you can see in their URL: https://PROJECT_ID.dashboard.obs.fr-par.scw.cloud/. This Project should correspond to the one used in the URL you used to perform Generative APIs requests, such as https://api.scaleway.ai/{PROJECT_ID}/v1/chat/completions. You can list your projects and their IDs in your Organization dashboard.

Example error behavior

When displaying the wrong Cockpit for the Project:
- Counter for Tokens Processed or API Requests should display a value of 0
- Graph across time should be empty
When displaying the Cockpit of a specific Project, but waiting for average token consumption to display:
- Counter for Tokens Processed or API Requests should display a correct value (different from 0)
- Graph across time should be empty

Embedding vectors cannot be stored in a database or used with a third-party library

Cause

The embedding model you are using generates vector representations with a fixed dimension number, which is too high for your database or third-party library.

For example, the embedding model bge-multilingual-gemma2 generates vector representations with 3584 dimensions. However, when storing vectors using PostgreSQL pgvector extensions, indexes (in hnsw or ivfflat formats) only support up to 2000 dimensions.

Solution

Use an embedding model supporting custom dimensions number such as qwen3-embedding-8b (also known as Matryoshka embeddings. Specify a dimensions field below 2000 when using PostgreSQL pgvector extensions indexes.
Use a vector store supporting higher dimensions numbers, such as Qdrant.
Do not use indexes for vectors or disable them from your third-party library. This may limit performance in vector similarity search for significant volumes.
- When using Langchain PGVector method, this method does not create an index by default and should not raise errors.
- When using the Mastra library with vectorStoreName: "pgvector", specify indexConfig type as flat to avoid creating any index on vector dimensions.
```
await vectorStore.createIndex({
  indexName: 'papers',
  dimension: 3584,
  indexConfig: {"type":"flat"},
});
```
Use a model with a lower number of dimensions. Using Managed Inference, you can deploy for instance thesentence-t5-xxl model, which represents vectors with 768 dimensions.

Previous messages are not taken into account by the model

Causes

Previous messages are not sent to the model
The content sent exceeds the maximum context window for this model

Solution

LLM models are completely "stateless" and thus do not store previous messages or conversations. For example, when building a chatbot application, each time a new message is sent by the user, all preceding messages in the conversation need to be sent through the API payload. Example payload for multi-turn conversation:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.scaleway.ai/v1",
    api_key=os.getenv("SCW_SECRET_KEY")
)

response = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[
      {
        "role": "user",
        "content": "What is the solution to 1+1= ?"
      },
      {
        "role": "assistant",
        "content": "2"
      },
      {
        "role": "user",
        "content": "Double this number"
      }
    ]
)

print(response.choices[0].message.content)

This snippet will output the model response, which is 4.

When exceeding the maximum context window, you should receive a 400 - BadRequestError detailing the context length value you exceeded. In this case, you should reduce the size of the content you send to the API.

Best practices for optimizing model performance

Input size management

Avoid overly long input sequences; break them into smaller chunks if needed.
Use summarization techniques for large inputs to reduce token count while maintaining relevance.

Use proper parameter configuration

Double-check parameters like "temperature", "max_tokens", and "top_p" to ensure they align with your use case.
For structured output, always include a "response_format" and, if possible, a detailed JSON schema.

Debugging silent errors

For cases where no explicit error is returned:
- Verify all fields in the API request are correctly named and formatted.
- Test the request with smaller and simpler inputs to isolate potential issues.

Still need help?

Create a support ticket