Generative APIs FAQ
Overview
What are Scaleway Generative APIs?
Scaleway's Generative APIs provide access to pre-configured, serverless endpoints of leading AI models, hosted in European data centers. This allows you to integrate advanced AI capabilities into your applications without managing underlying infrastructure.
What is the difference between Generative APIs and Managed Inference?
- Generative APIs: A serverless service providing access to pre-configured AI models via API, billed per token usage.
- Managed Inference: Allows deployment of curated or custom models with chosen quantization and Instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not.
How do I get started with Generative APIs?
To get started, explore the Generative APIs Playground in the Scaleway console. For application integration, refer to our Quickstart guide, which provides step-by-step instructions on accessing, configuring, and using a Generative APIs endpoint.
Offering and availability
Where are the inference servers located?
All models are currently hosted in a secure data center located in Paris, France, operated by OPCORE. This ensures low latency for European users and compliance with European data privacy regulations.
Which models are supported by Generative APIs?
Our Generative APIs support a range of popular models, including:
- Chat / Text Generation models: Refer to our dedicated documentation for a list of supported chat models.
- Vision models: Refer to our dedicated documentation for a list of supported vision models.
- Embedding models: Refer to our dedicated documentation for a list of supported embedding models.
What is the model lifecycle for Generative APIs?
Scaleway is dedicated to updating and offering the latest versions of generative AI models, while ensuring older models remain accessible for a significant time, and also ensuring the reliability of your production applications. Learn more in our model lifecycle policy.
Pricing and billing
How does the free tier work?
The free tier allows you to process up to 1,000,000 tokens without incurring any costs. After reaching this limit, you will be charged per million tokens processed. Free tier usage is calculated by adding all input and output tokens consumed by all models used. For more information, refer to our pricing page or access your bills by token types and models in the billing section from Scaleway Console (past and provisional bills for the current month).
Note that when your consumption exceeds the free tier, you will be billed for each additional token consumed by the model and token type. The minimum billing unit is 1 million tokens. Here are two examples of low volume consumption:
Example 1: Free Tier only
| Model | Token type | Tokens consumed | Price | Bill |
|---|---|---|---|---|
llama-3.3-70b-instruct | Input | 500k | €0.90/million tokens | €0.00 |
llama-3.3-70b-instruct | Output | 200k | €0.90/million tokens | €0.00 |
mistral-small-3.1-24b-instruct-2503 | Input | 100k | €0.15/million tokens | €0.00 |
mistral-small-3.1-24b-instruct-2503 | Output | 100k | €0.35/million tokens | €0.00 |
Total tokens consumed: 900k
Total bill: 0.00€
Example 2: Exceeding Free Tier
| Model | Token type | Tokens consumed | Price | Billed consumption | Bill |
|---|---|---|---|---|---|
llama-3.3-70b-instruct | Input | 800k | €0.90/million tokens | 1 million tokens | €0.00 (Free Tier application) |
llama-3.3-70b-instruct | Output | 2 500k | €0.90/million tokens | 3 million tokens | €2.70 |
mistral-small-3.1-24b-instruct-2503 | Input | 100k | €0.15/million tokens | 1 million tokens | €0.15 |
mistral-small-3.1-24b-instruct-2503 | Output | 100k | €0.35/million tokens | 1 million tokens | €0.35 |
Total tokens consumed: 900k
Total billed consumption: 6 million tokens
Total bill: €3.20
Note that in this example, the first line where the free tier applies will not display in your current Scaleway bills by model, but will instead be listed under Generative APIs Free Tier - First 1M tokens for free.
What are tokens, and how are they counted?
A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:
- For text, on average,
1token corresponds to~4characters, and thus0.75words (as words are on average five characters long) - For images,
1token corresponds to a square of pixels. For example,mistral-small-3.1-24b-instruct-2503model image tokens are28x28pixels (28-pixels height, and 28-pixels width, hence784pixels in total). - For audio:
1token corresponds to a duration of time. For example,voxtral-small-24b-2507model audio tokens are80milliseconds.- Some models process audio in chunks having a minimum duration. For example,
voxtral-small-24b-2507model process audio in30second chunks. This means audio lasting13seconds will be considered375tokens (30seconds /0.08seconds). And audio lasting178seconds will be considered2 250tokens (30seconds *6/0.08seconds).
The exact token count and definition depend on the tokenizer used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in mistral-small-3.1-24b-instruct-2503 size limit documentation). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the tokenizer_config.json file.
How can I monitor my token consumption?
You can see your token consumption in Scaleway Cockpit. You can access it from the Scaleway console under the Metrics tab. Note that:
- Cockpits are isolated by Project, hence you first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the
project_idin the Cockpit URL:https://{project_id}.dashboard.obs.fr-par.scw.cloud/). - Cockpit graphs can take up to 5 minutes to update token consumption. See Troubleshooting for further details.
Can I configure a maximum billing threshold?
Currently, you cannot configure a specific threshold after which your usage will be blocked. However:
- You can configure billing alerts to ensure you are warned when you hit specific budget thresholds.
- Your total billing remains limited by the amount of tokens you can consume within rate limits.
- If you want to ensure fixed billing, you can use Managed Inference, which provides the same set of OpenAI-compatible APIs and a wider range of models.
How can I give access to token consumption to my users outside Scaleway?
If your users do not have a Scaleway account, you can still give them access to their Generative API usage consumption by either:
- Collecting consumption data from the Billing API and exposing it to your users. Consumption can be detailed by Projects.
- Collecting consumption data from Cockpit data sources and exposing it to your users. As an example, you can query consumption using the following query:
curl -G 'https://{data-source-id}.metrics.cockpit.fr-par.scw.cloud/prometheus/api/v1/query_range' \
--data-urlencode 'query=generative_apis_tokens_total{resource_name=~".*",type=~"(input_tokens|output_tokens)"}' \
--data-urlencode 'start=2025-03-15T20:10:51.781Z' \
--data-urlencode 'end=2025-03-20T20:10:51.781Z' \
--data-urlencode 'step=1h' \
-H "Authorization: Bearer $COCKPIT_TOKEN" | jqMake sure that you replace the following values:
data-source-id: ID of your Scaleway metrics data source$COCKPIT_TOKEN: your Cockpit tokenstartandendtime properties: your specific time range
You can see your token consumption in Scaleway Cockpit. You can access it from the Scaleway console under the Metrics tab. Note that:
- Cockpits are isolated by Projects. You first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for the desired Project (you can see the
project_idin the Cockpit URL:https://{project_id}.dashboard.obs.fr-par.scw.cloud/). - Cockpit graphs can take up to 5 minutes to update token consumption. See Troubleshooting for further details.
Specifications
What are the SLAs applicable to Generative APIs?
Generative APIs targets a 99.9% monthly availability rate detailed in Service Level Agreement for Generative APIs.
What are the performance guarantees (vs Managed Inference)?
Generative APIs is optimized and monitored to provide reliable performance in most use cases, but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Managed Inference (dedicated deployment capacity) for applications with critical performance requirements.
As an order of magnitude, for Chat models, when performing request with stream activated:
- Time to first token should be less than
1second for most standard queries (with less than 1000 input tokens) - Output token generation speed should be above
100tokens per second for recent small to medium size models (such asgpt-oss-120bormistral-small-3.2-24b-instruct-2506)
Exact performance will still vary based mainly on the following factors:
- Model size and architecture: Smaller and more recent models usually provide better performance.
- Model type:
- Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above
1 000tokens). - Audio transcription models' time to first token remains mostly constant, as they only need to process small numbers of input tokens (
30seconds audio chunk) to generate a first output.
- Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above
- Input and output size: In rough terms, total processing time is proportional to input and output size. However, for larger queries (usually above
10 000tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries into the smallest meaningful parts (10queries with1 000input tokens and100output tokens will be processed faster than1query with10 000input tokens and1 000output tokens).
Quotas and limitations
Are there any rate limits for API usage?
Yes, API rate limits define the maximum number of requests a user can make within a specific time frame to ensure fair access and resource allocation between users. If you require increased rate limits, we recommend either:
- Using Managed Inference, which provides dedicated capacity and doesn't enforce rate limits (you remain limited by the total provisioned capacity)
- Contacting your existing Scaleway account manager or our Sales team to discuss volume commitment for specific models that will allow us to increase your quota proportionally.
Refer to our dedicated documentation for more information on rate limits.
Can I increase maximum output (completion) tokens for a model?
No, you cannot increase maximum output tokens above limits for each model in Generative APIs. These limits are in place to protect you against:
- Long generation, which may end by an HTTP timeout. Limits are designed to ensure a model will send its HTTP response in less than 5 minutes.
- Uncontrolled billing, as several models are known to be able to enter infinite generation loops (specific prompts can make the model generate the same sentence over and over, without stopping at all). If you require higher maximum output tokens, you can use Managed Inference, where these limits do not apply (as your bill will be limited to the size of your deployment).
Compatibility and integration
Can I use OpenAI libraries and APIs with Scaleway's Generative APIs?
Yes, Scaleway's Generative APIs are designed to be compatible with OpenAI libraries and SDKs, including the OpenAI Python client library and LangChain SDKs. This allows for seamless integration with existing workflows.
How can I convert audio files to a supported format?
For audio transcription, supported formats are: flac, mp3, mpeg, mpga, oga, ogg, wav.
For unsupported formats such as m4a, we recommend using third-party libraries or tools to convert them to a supported format, such as ffmpeg or VLC.
For example, you can convert an m4a file to mp3 using ffmpeg with:
ffmpeg -i audio-file.m4a audio-file.mp3where audio-file.m4a is your original file.
Which vector database can I use to store embeddings?
Since /embeddings API returns a raw list of vector coordinates, all vector databases are by default compatible with this format.
However, some vector databases may only support a maximum number of dimensions below the dimensions returned by a model.
In this case, we recommend to use models which supports custom dimensions number (also known as Matryoshka embeddings).
As an example, when using PostgreSQL pgvector extension, we recommend to use qwen3-embedding-8b embeddings model with 2000 dimensions, to ensure compatibility with vector indexes such as hnsw or ivfflat
Usage and management
Do model licenses apply when using Generative APIs?
Yes, you need to comply with model licenses when using Generative APIs. Applicable licenses are available for each model in our documentation and in the console Playground.
Privacy and security
Where can I find the privacy policy regarding Generative APIs?
You can find the privacy policy applicable to all use of Generative APIs here.