Generative APIs

Serve the latest AI models via API, pay by million token

OpenAI-compatible APIs

Easily integrate with existing tools like OpenAI libraries and LangChain SDKs. Our APIs are designed to work out-of-the-box with your existing workflows, including adapters for Retrieval-Augmented Generation (RAG).

Cost-effective usage

Optimize your budget with a pay-per-use model, billed per million tokens. No upfront infrastructure costs or long-term commitments—just flexible pricing ideal for varying workloads or exploratory projects.

Quick model testing

Start serving and testing AI models in just a few minutes. Our streamlined onboarding process and serverless architecture let you deploy endpoints instantly, enabling rapid iteration and minimal setup time.

Towards a sovereign AI where your data remains yours, and only in Europe.

Everything you need to create apps with Generative AI

: With Retrieval-Augmented Generation (RAG) – a technique that involves retrieving data from enterprise data sources– you can enrich your AI model with private and up-to-date information for more relevant and accurate answers.

RAG is easy with Scaleway: embeddings, vector database, Langchain: here's your step by step guide.

: An agent actively performs tasks to achieve a specified outcome. When connected via APIs, it can interact with systems to execute actions. Generative APIs enable models to handle multi-step tasks using organizational data, like answering customer inquiries or processing bookings (thanks to Serverless Functions). An autonomous agent interprets user requests and autonomously triggers APIs and databases to complete tasks.

: Create LLM-based, multimodal assistants (copilot, chatbot etc), that understand user requests, automatically break down tasks, engage in dialogue to gather information, and boost productivity over so many tasks. Your virtual assistant can now translate languages, summarize content, analyze sentiment, answer questions, ect…

: Traditional OCR models struggle with tasks that require understanding both text and visuals, but the multimodal vision-language models (VLMs) available through Scaleway Generative APIs bridge this gap. VLMs are ideal for real-world applications like scanned documents and technical diagrams, making them a powerful toolkit for mixed-content processing.

: Analyze call/video recordings securely in order to identify sentiment, mood, risks, needs. Combined with powerful LLMs the upcoming speech-to-text capability will enable telecom giants to improve quality of services while providing agents with highly valuable insights.

Models' prices

Enjoy a free tier of 1,000,000 tokens. Every new customer gets 1,000,000 free tokens—start paying only from the 1,000,001st token.

Model	Type	Input tokens	Output tokens
llama-3.1-8b-instruct	Text generation	€0.20^{/million tokens}	€0.20^{/million tokens}
llama-3.1-70b-instruct	Text generation	€0.90^{/million tokens}	€0.90^{/million tokens}
llama-3.3-70b-instruct	Text generation	€0.90^{/million tokens}	€0.90^{/million tokens}
mistral-nemo-instruct-2407	Text generation	€0.20^{/million tokens}	€0.20^{/million tokens}
qwen2.5-coder-32b-instruct	Code Generation	€0.90^{/million tokens}	€0.90^{/million tokens}
pixtral-12b-2409	Image analysis	€0.20^{/million tokens}	€0.20^{/million tokens}
bge-multilingual-gemma2	Embedding	€0.20^{/million tokens}	N/A
deepseek-r1-distill-llama-70b	Text Generation	€0.90^{/million tokens}	€0.90^{/million tokens}

Instantly access popular models from leading AI labs

Get started with Generative APIs

Exceptional developer experience meets best-in-class AI

Competitive pricing

Scaleway offers a competitive playground allowing you to quickly experiment with different AI models. Once satisfied with the responses, simply export the payload and replicate at scale!

Check prices

Open weight FTW

Scaleway supports the distribution of cutting-edge open-weight models, whose performance in reasoning and features now rivals that of proprietary models like GPTx or Claude.

Find supported models

Low latency

End-users in Europe will benefit from response time below 200ms to get the first tokens streamed, ideal for interactive dialog and agentic workflows even at high context lengths.

Send your first API request

Structured outputs

Our built-in JSON mode or JSON schema can distill and transform the diverse unstructured outputs of LLMs into actionable, reliable, machine-readable structured data.

How to use structured outputs

Native function calling

Generative AI models served at Scaleway can connect to external tools through Serverless Functions. Integrate LLMs with custom functions or APIs, and easily build applications able to interface with external systems.

How to use function calling

Secured for production

Scaleway's inference stack runs on highly secure, reliable infrastructure in Europe. Designed to enable your prototypes and run your production, this complete stack Managed Inference complements Generative APIs for use cases requiring guaranteed throughput as it offers a dedicated infrastructure.

Read our security measures

Find more details in the docs

Designed as drop-in replacement for the OpenAI APIs

# Import modules
from openai import OpenAI
import os

# Initialize the OpenAI client using Scaleway
client = OpenAI(
    api_key=os.environ.get("SCW_API_KEY"),
    base_url='https://api.scaleway.ai/v1' 
)

# Create a chat completion request
completion = client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Sing me a song about Xavier Niel'
        }
    ],
    model='mistral-nemo-instruct-2407'
)

Get started with tutorials

How to query text modelsUsing the Chat API for generating and manipulating conversations.
How to use embeddingsUsing the Embeddings API for generating vector representations based on your data.
How to build your first RAG applicationStep by step Retrieval-Augmented Generation (RAG) with LangChain and Scaleway Generative APIs
How to process images with a vision modelGetting structured outputs with Pixtral
How to implement function calling in your applicationsBuilding a flight assistant with function calling on open-weight Llama 3.1
User-friendly interface to put an end to shadow ITDeploying Open WebUI to leverage powerful AI models in a user-friendly, self-hosted interface

Tutorials

Frequently asked questions

What is Scaleway Generative APIs?

Generative APIs is Scaleway's fully managed service that makes frontier AI models from leading research labs available via a simple API call.

How can I get access to Scaleway Generative APIs?

Access to this service is open to all Scaleway customers. You can begin using it via Scaleway's console playground or via API right away, see the quickstart guide here.
If you need support, don't hesitate to reach out to us through the dedicated slack community #ai

What is the pricing of Scaleway Generative APIs?

This service is totally free while in beta. Once in general availability stage, Generative APIs will be with a "pay-as-you-go" pricing, or "pay per tokens" since your consumption will be charged per 1M tokens in/out.

Where are Scaleway's inference servers located?

We currently host all models in a secure datacenter located in France, Paris only. This may change in the future.

Can I use the OpenAI libraries and APIs?

Scaleway lets you seamlessly transition applications already utilizing OpenAI. You can use any of the OpenAI official libraries, for example the OpenAI Python client library or Azure OpenAI sdk, to interact with your Scaleway Generative APIs. Find here the APIs and parameters supported.

What is the difference with Scaleway Managed Inference?

Scaleway Generative APIs is a serverless service. This is most likely the easiest way to get started: We have set up the hardware, so you only pay per token/file and don’t wait for boot-ups.
Scaleway Managed Inference on the other hand is meant to deploy curated models or your own models, with the quantization and instances of your choice. You will get predictable throughput, as as well as custom security: isolation in your private network, access control…

Both AI services offer text and multi-modal (image understanding) models, OpenAI compatibility and important capabilities like structured outputs.

What are the rate limit and the quotas?

Any model served through Scaleway Generative APIs gets limited by:

Tokens per minute
Queries per minute

Set up your credit card and pass the KYC process to benefit from the official rate limits.
Read the dedicated documentation to know more.

RAG

Autonomous Agents

LLM-based Assistant

Boosted OCR

Audio transcription (soon)