Generative APIs FAQ
Overview
What is Scaleway Generative APIs - Serverless?
Scaleway's Generative APIs - Serverless (formerly known as Scaleway Generative APIs) provides access to pre-configured, serverless endpoints of leading AI models, hosted in European data centers. This allows you to integrate advanced AI capabilities into your applications without managing underlying infrastructure.
What is Scaleway Generative APIs - Dedicated Deployment?
Scaleway's Generative APIs - Dedicated Deployment (formerly known as Scaleway Managed Inference) is a fully managed service that allows you to deploy, run, and scale AI models in a dedicated environment. It provides optimized infrastructure, customizable deployment options, and secure access controls to meet the needs of enterprises and developers looking for high-performance inference solutions.
What is the difference between Serverless and Dedicated Deployment?
- Generative APIs - Serverless (formerly known as Scaleway Generative APIs): A serverless service providing access to pre-configured AI models via API, billed per token usage.
- Generative APIs - Dedicated Deployment (formerly known as Scaleway Managed Inference): Allows deployment of curated or custom models with chosen quantization and Instances, offering predictable throughput and enhanced security features such as private network isolation and access control. The service is billed based on hourly usage, regardless of whether the provisioned capacity is receiving traffic or not.
I'm looking for Scaleway Managed Inference. Has it been discontinued?
Managed Inference has been renamed to Generative APIs - Dedicated Deployment. It is the same product, just with a new name. All features and functionality remain unchanged.
Is Generative APIs - Dedicated Deployment suitable for real-time applications?
Yes, Generative APIs - Dedicated Deployment is designed for low-latency, high-throughput applications, making it suitable for real-time use cases such as chatbots, recommendation systems, fraud detection, and live video processing.
Can I fine-tune or retrain my models within Generative APIs - Dedicated Deployment?
Generative APIs - Dedicated Deployment is primarily designed for deploying and running inference workloads. If you need to fine-tune or retrain models, you may need to use a separate training environment, such as Scaleway’s GPU Instances, and then deploy the trained model in Generative APIs - Dedicated Deployment.
Getting started
How do I get started with Generative APIs - Serverless?
To get started, explore the Generative APIs Playground in the Scaleway console. For application integration, refer to our Quickstart guide, which provides step-by-step instructions on accessing, configuring, and using a Generative APIs endpoint.
How do I deploy a model using Generative APIs - Dedicated Deployment?
Deployment is done through Scaleway's console or API. You can choose a model from Scaleway’s selection or import your own directly from Hugging Face's repositories, configure Instance types, set up networking options, and start inference with minimal setup. For details, see the document about how to create a deployment.
Can I run inference on private models?
Yes, Generative APIs - Dedicated Deployment allows you to deploy private models with access control settings. You can restrict access to specific users, teams, or networks.
Offering and availability
Which models are supported by Generative APIs?
Our Generative APIs support a range of popular models, including:
- Large Language Models (LLMs)
- Chat / Text Generation models
- Vision models
- Embedding models
- Audio recognition models
- Custom AI models (through API only yet)
For details, refer to our Supported models catalog.
Generative APIs - Dedicated Deployment supports both open-source models and your own uploaded proprietary models.
What is the model lifecycle for Generative APIs - Serverless?
Scaleway is dedicated to updating and offering the latest versions of generative AI models, while ensuring older models remain accessible for a significant time, and also ensuring the reliability of your production applications. Learn more in our model lifecycle policy.
Where are the inference servers located?
All models are currently hosted in a secure data center located in Paris, France, operated by OPCORE. This ensures low latency for European users and compliance with European data privacy regulations.
What Instance types are available for inference?
Generative APIs - Dedicated Deployment offers different Instance types optimized for various workloads from Scaleway's GPU Instances range. You can select the Instance type based on your model’s computational needs and compatibility.
Pricing and billing
How does the Free Tier work?
There is a Free Tier available for Generative APIs - Serverless. The Free Tier allows you to process, without incurring any costs, up to:
- 1,000,000 tokens for models billed by tokens
- 60 minutes of audio transcription for models billed by audio minutes
After reaching this limit, you will be charged per million tokens processed and per minutes of audio processed. Free Tier usage is calculated by adding all input / output tokens and audio minutes consumed by all models used. For more information, refer to our pricing page or access your bills by models in the billing section of the Scaleway Console (past and provisional bills for the current month).
When your consumption exceeds the Free Tier, you will be billed for each additional token consumed by the model and token type. The minimum billing unit is 1,000 tokens. Here are two examples of low-volume consumption:
Example 1: Free Tier only
| Model | Token type | Tokens consumed | Price | Bill |
|---|---|---|---|---|
llama-3.3-70b-instruct | Input | 500k | €0.90/million tokens | €0.00 |
llama-3.3-70b-instruct | Output | 200k | €0.90/million tokens | €0.00 |
mistral-small-3.2-24b-instruct-2506 | Input | 100k | €0.15/million tokens | €0.00 |
mistral-small-3.2-24b-instruct-2506 | Output | 100k | €0.35/million tokens | €0.00 |
Total tokens consumed: 900k
Total bill: 0.00€
Example 2: Exceeding the Free Tier
| Model | Token type | Tokens consumed | Price | Billed consumption | Bill |
|---|---|---|---|---|---|
llama-3.3-70b-instruct | Input | 800k | €0.90/million tokens | 800k tokens | €0.72 |
llama-3.3-70b-instruct | Output | 2 500k | €0.90/million tokens | 1.5 million tokens | €1.35 (€2.25 - €0.90 from Free Tier application) |
mistral-small-3.1-24b-instruct-2503 | Input | 100k | €0.15/million tokens | 100k tokens | €0.02 (€0.015 rounded up to a euro cent) |
mistral-small-3.1-24b-instruct-2503 | Output | 100k | €0.35/million tokens | 100k tokens | €0.04 (€0.035 rounded up to a euro cent) |
Total tokens consumed: 3.5 millions tokens
Total billed consumption: 2.5 millions tokens
Total bill: €2.13
Note that in this example, the first line where the Free Tier applies will not display in your current Scaleway bills by model, but will instead be listed under Offer deducted - Generative APIs Free Tier. If you are using multiple projects (configuring a URL such as api.scaleway.ai/{project_id}/v1/chat/completions), Free Tier will be applied to each project proportionally to their Generative APIs consumption. Assuming that project A consumed 1€, project B consumed 3€, and the Free Tier amount is 2€: project A Free Tier will be (1/(3+1))*2=0.50€ and project B Free Tier will be (3/(3+1))*2=1.50€.
Free Tier for audio transcription models is applied similarly for models billed by audio minutes.
What are tokens, and how are they counted?
A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:
- For text, on average,
1token corresponds to~4characters, and thus0.75words (as words are on average five characters long). - For images,
1token corresponds to a square of pixels. For example,mistral-small-3.1-24b-instruct-2503model image tokens are28x28pixels (28-pixels height, and 28-pixels width, hence784pixels in total). - For audio:
1token corresponds to a duration of time. For example,voxtral-small-24b-2507model audio tokens are80milliseconds.- Some models process audio in chunks having a minimum duration. For example,
voxtral-small-24b-2507model process audio in30-second chunks. This means audio lasting13seconds will be considered375tokens (30seconds /0.08seconds). And audio lasting178seconds will be considered2,250tokens (30seconds *6/0.08seconds).
The exact token count and definition depend on the tokenizer used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance, in mistral-small-3.1-24b-instruct-2503 size limit documentation). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the tokenizer_config.json file.
How can I monitor my token consumption?
You can see your token consumption in Scaleway Cockpit. You can access it from the Scaleway console under the Metrics tab. Note that:
- Cockpits are isolated by Project, hence you first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the
project_idin the Cockpit URL:https://{project_id}.dashboard.obs.fr-par.scw.cloud/). - Cockpit graphs can take up to 5 minutes to update token consumption. See Troubleshooting for further details.
In these dashboards, you can view consumption by IAM principal (users or applications) by filtering on the IAM principal label or by opening the Usage details per IAM Principal ID panel.
Note that for low consumption volumes (less than a few million tokens per hour), the displayed values over long time ranges (e.g., several days) may be inaccurate by several percent. This is a known limitation of Grafana queries using PromQL for discrete values: while exact consumption is stored in metrics, aggregated values in graphs are extrapolated from instantaneous samples.
To minimize this effect, you can:
- Select a narrower time range (e.g., 1 hour) to ensure sufficient data samples for accurate reconstruction of the consumption graph.
- Query the data directly using PromQL on the data source.
Can I configure a maximum billing threshold?
Currently, you cannot configure a specific threshold after which your usage will be blocked. However:
- You can configure billing alerts to ensure you are warned when you hit specific budget thresholds.
- Your total billing remains limited by the amount of tokens you can consume within rate limits.
- To ensure fixed billing, you can use Generative APIs - Dedicated Deployment, which provides the same set of OpenAI-compatible APIs and a wider range of models.
How can I give access to token consumption to my users outside Scaleway?
If your users do not have a Scaleway account, you can still give them access to their Generative API usage consumption by either:
- Collecting consumption data from the Billing API and exposing it to your users. Consumption can be detailed by Projects.
- Collecting consumption data from Cockpit data sources and exposing it to your users. As an example, you can query consumption using the following query:
curl -G 'https://{data-source-id}.metrics.cockpit.fr-par.scw.cloud/prometheus/api/v1/query_range' \
--data-urlencode 'query=generative_apis_tokens_total{resource_name=~".*",type=~"(input_tokens|output_tokens)"}' \
--data-urlencode 'start=2025-03-15T20:10:51.781Z' \
--data-urlencode 'end=2025-03-20T20:10:51.781Z' \
--data-urlencode 'step=1h' \
-H "Authorization: Bearer $COCKPIT_TOKEN" | jqMake sure that you replace the following values:
data-source-id: ID of your Scaleway metrics data source$COCKPIT_TOKEN: your Cockpit tokenstartandendtime properties: your specific time range
You can see your token consumption in Scaleway Cockpit. You can access it from the Scaleway console under the Metrics tab. Note that:
- Cockpits are isolated by Projects. You first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for the desired Project (you can see the
project_idin the Cockpit URL:https://{project_id}.dashboard.obs.fr-par.scw.cloud/). - Cockpit graphs can take up to 5 minutes to update token consumption. See Troubleshooting for further details.
How is Generative APIs - Dedicated Deployment billed?
Billing is based on the Instance type and usage duration (in minutes). Unlike Generative APIs - Serverless, which are billed per token, Generative APIs - Dedicated Deployment provides predictable costs based on the allocated infrastructure. Billing only starts when a deployment is ready and can be queried. Pricing details can be found on the Scaleway pricing page.
Can I pause Generative APIs - Dedicated Deployment billing when the Instance is not in use?
When a dedicated Generative APIs deployment is running, corresponding resources are provisioned and thus billed. Resources can therefore not be paused. However, you can still optimize your Generative APIs dedicated deployment to fit within specific time ranges (such as during working hours). To do so, you can automate deployment creation and deletion using the Generative APIs - Dedicated Deployment API, Terraform, or Scaleway SDKs. These actions can be programmed using Serverless Jobs to be automatically carried out periodically.
Specifications
What are the SLAs applicable to Generative APIs - Serverless?
Generative APIs - Serverless targets a 99.9% monthly availability rate, as detailed in Service Level Agreement - Generative APIs.
What are the SLAs applicable to Generative APIs - Dedicated Deployment?
We are currently working on defining our SLAs for Generative APIs - Dedicated Deployment. We will provide more information on this topic soon.
What are the performance guarantees for Serverless (vs Dedicated Deployment)?
Generative APIs - Serverless is optimized and monitored to provide reliable performance in most use cases, but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Generative APIs - Dedicated Deployment (dedicated deployment capacity) for applications with critical performance requirements.
As an order of magnitude, for Chat models, when performing requests with stream activated:
- Time to first token should be less than
1second for most standard queries (with less than 1,000 input tokens) - Output token generation speed should be above
100tokens per second for recent small to medium-sized models (such asgpt-oss-120bormistral-small-3.2-24b-instruct-2506)
Exact performance will still vary based mainly on the following factors:
- Model size and architecture: Smaller and more recent models usually provide better performance.
- Model type:
- Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above
1,000tokens). - Audio transcription models' time to first token remains mostly constant, as they only need to process small numbers of input tokens (
30-second audio chunk) to generate a first output.
- Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above
- Input and output size: In rough terms, total processing time is proportional to input and output size. However, for larger queries (usually above
10,000tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries into the smallest meaningful parts (10queries with1,000input tokens and100output tokens will be processed faster than1query with10,000input tokens and1,000output tokens).
How long does a batch take to be processed using the Serverless Batches API endpoint, and how do I optimize this time?
We aim to process any batch within 24 hours. After this delay, batch processing will be stopped, and any remaining unprocessed queries will not be billed. Batches are processed in the order they were created. You can optimize time before receiving a batch output by splitting a batch into multiple smaller ones. For example:
- Assuming a batch of
10,000requests will take10 hoursto be processed. - Splitting this batch into
10 batchesof1,000 requests, each will take the same time (e.g.,10 hours) to process all batches. However, the first batch output will be provided after1 hour, the second one after2 hours, and so on.
What are the performance guarantees for Dedicated Deployment (vs Serverless)?
Generative APIs - Dedicated Deployment provides dedicated resources, ensuring predictable performance and lower latency compared to Generative APIs - Serverless, which is a shared, serverless offering optimized for infrequent traffic with moderate peak loads. Generative APIs - Dedicated Deployment is ideal for workloads that require consistent response times, high availability, custom hardware configurations, or generate extreme peak loads during a narrow period. Compared to Generative APIs - Serverless, no usage quota is applied to the number of tokens per second generated, since the output is limited by the GPU Instance size and the number of your dedicated Scaleway Generative APIs deployments.
Quotas and limitations
Are there any rate limits for Serverless API usage?
Yes, API rate limits define the maximum number of requests a user can make within a specific time frame to ensure fair access and resource allocation between users. If you require increased rate limits, we recommend either:
- Using the Batches API for non-real time workloads. Requests performed through the Batches API do not have a rate limit and are billed with a -50% discount compared to standard model prices.
- Using Generative APIs - Dedicated Deployment, which provides dedicated capacity and doesn't enforce rate limits (you remain limited by the total provisioned capacity)
- Contacting your existing Scaleway account manager or our Sales team to discuss volume commitment for specific models that will allow us to increase your quota proportionally.
Refer to our dedicated documentation for more information on rate limits.
Can I increase maximum output (completion) tokens for a model when using Serverless?
No, you cannot increase the maximum number of output tokens beyond the limits for each model. These limits are in place to protect you against:
- Long generation, which may end by an HTTP timeout. Limits are designed to ensure a model will send its HTTP response in less than 5 minutes.
- Uncontrolled billing, as several models are known to be able to enter infinite generation loops (specific prompts can make the model generate the same sentence over and over, without stopping at all). If you require higher maximum output tokens, you can use Generative APIs - Dedicated Deployment, where these limits do not apply (as your bill will be limited to the size of your deployment).
Can I increase the maximum number of concurrent requests when using Serverless?
By default, you cannot increase the maximum number of concurrent requests beyond the limits for all models.
However, for embedding models, you can batch multiple inputs by providing an array of strings in a single request. For example, with qwen3-embedding-8b, you can send up to 2,048 strings of 32,000 input tokens each, in a single query.
If you have a specific use case that requires higher concurrency limits, we recommend using Generative APIs - Dedicated Deployment, where these limits do not apply or, contacting our support team with details about your use case and your expected concurrency requirements.
Do model licenses apply when using Serverless?
Yes, you need to comply with model licenses when using Generative APIs - Serverless. Applicable licenses are available for each model in our documentation and in the console Playground.
Do model licenses apply when using Dedicated Deployment?
Yes, model licenses need to be complied with when using Generative APIs - Dedicated Deployment. Applicable licenses are available for each model in our documentation.
- For models provided in the Scaleway catalog, you need to accept licenses (including potential EULA) before creating any dedicated Generative APIs deployment.
- For custom models you choose to import on Scaleway, you are responsible for complying with model licenses (as with any software you choose to install on a GPU Instance, for example).
Compatibility and integration
Can I use OpenAI libraries and APIs with Scaleway's Generative APIs?
Yes, Scaleway's Generative APIs - Serverless is designed to be compatible with OpenAI libraries and SDKs, including the OpenAI Python client library and LangChain SDKs. This allows for seamless integration with existing workflows. For detailed information, see OpenAI API compatibility documentation.
How can I convert audio files to a supported format?
For audio transcription, supported formats are: flac, mp3, mpeg, mpga, oga, ogg, wav.
For unsupported formats such as m4a, we recommend using third-party libraries or tools to convert them to a supported format, such as ffmpeg or VLC.
For example, you can convert an m4a file to mp3 using ffmpeg with:
ffmpeg -i audio-file.m4a audio-file.mp3Where audio-file.m4a is your original file.
Can I transcribe audio streams?
Streams are currently only supported for audio output, not input. As a workaround, you can send small chunks of audio lasting a few seconds each, and activate output streaming with:
curl https://api.scaleway.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SCW_SECRET_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@path/to/audio.mp3" \
-F model="whisper-large-v3" \
-F stream=trueAudio streaming will start as soon as the first 30-second chunk is processed, i.e., after only a few seconds. This is close enough to realtime for many user / audio agent interactions.
If you need to stitch together audio transcriptions and avoid word duplication between two segments, you can provide the last few words of a chunk's transcription as a prompt for the next chunk. This will guide model decoding and provide a better output.
Since billing is done per second of audio input for models such as whisper-large-v3, this method does not incur additional costs. Billing is based on the length (in seconds) of each audio file.
Which vector database can I use to store embeddings?
Since the /embeddings API returns a raw list of vector coordinates, all vector databases are by default compatible with this format.
However, some vector databases may only support a maximum number of dimensions below the dimensions returned by a model.
In this case, we recommend using models which support custom numbers of dimensions (also known as Matryoshka embeddings).
As an example, when using the PostgreSQL pgvector extension, we recommend using the qwen3-embedding-8b embedding model with 2,000 dimensions, to ensure compatibility with vector indexes such as hnsw or ivfflat.
Can I use Generative APIs - Dedicated Deployment with other Scaleway services?
Absolutely. Generative APIs - Dedicated Deployment integrates seamlessly with other Scaleway services, such as Object Storage for model hosting, Kubernetes for containerized applications, and Scaleway IAM for access management.
Does Generative APIs - Dedicated Deployment support model quantization?
Yes, Scaleway Generative APIs - Dedicated Deployment supports model quantization to optimize performance and reduce inference latency. You can select different quantization options depending on your accuracy and efficiency requirements.
Usage and management
How can I monitor performance?
Generative APIs - Dedicated Deployment metrics and logs are available in Scaleway Cockpit. You can follow your deployment metrics in real-time, such as token throughput, request latency, GPU power usage, and GPU VRAM usage.
Privacy and safety
Where can I find information regarding the data, privacy, and security policies applied to Scaleway's AI services?
You can find detailed information regarding the policies applied to Scaleway's AI services in our dedicated documentation: