Choose among ready-to-be-served AI models
What makes inference fast? Among many things, model optimization. That's why Scaleway is providing an evolutionary Model Library, offering curated and quantized models, including LLMs and embeddings.
Serve Generative AI models and answer prompts from European end-consumers securely. A drop-in replacement for all apps using OpenAI APIs.
What makes inference fast? Among many things, model optimization. That's why Scaleway is providing an evolutionary Model Library, offering curated and quantized models, including LLMs and embeddings.
No matter how big your usage is, you pay the same —predictable— price for unlimited tokens. This price depends on the dedicated infrastructure that serves your model, which is billed per hour.
Maintain complete data control: your prompts and responses are not stored and cannot be accessed by Scaleway or any third parties, keeping your data exclusively yours and within Europe.
Vision language model able to analyze your images and offer insights without compromising on instruction following. Another fantastic model made by Mistral AI and distributed under the Apache 2.0 license.
Llama 3.1 by Meta is the latest iteration of the open-access Llama family, for efficient deployment and development on smaller GPUs. Llama models are tailored for dynamic dialogues and creative text generation. Engineered with the latest in efficiency and scalability, it excels in complex reasoning and coding tasks. Its advanced Grouped-Query Attention mechanism ensures unparalleled processing prowess, making it the ultimate tool for chat applications and beyond.
Llama-3.1-70b-Instruct by Meta is the most powerful model in the Llama family, boasting 70 billion parameters for unmatched text generation. Ideal for advanced AI applications, it excels in multilingual dialogues, complex reasoning, and intricate coding tasks. Its Grouped-Query Attention mechanism ensures efficient processing, making it the top choice for high-end deployment.
A 12B model with a very large context window of up to 128k tokens, particularly useful for RAG applications. Built by Mistral in collaboration with NVIDIA, this model is distributed under the Apache 2.0 license.
Trained on Scaleway's Nabuchodonosor 2023, Mixtral-8x7B is a state-of-the-art, pretrained generative model known as a Sparse Mixture of Experts. It has been benchmarked to surpass the performance of the Llama 2 70B model across a variety of tests.
An advanced embedding model that translates data into vectors, capturing complex relationships for enhanced information processing. Perfect to set up your Retrieval-Augmented Generation (RAG) system.
You are charged for usage of the GPU type you choose.
Model | Quantization | GPU | Price | Approx. per month |
Llama3.1-8b-instruct | BF16, FP8 | L4-1-24G | €0.93/hour | ~€679/month |
Llama3.1-70b-instruct | FP8 | H100-1-80G | €3.40/hour | ~€2482/month |
Llama3.1-Nemotron-70b-instruct | FP8 | H100-1-80G | €3.40/hour | ~€2482/month |
Mistral-7b-instruct-v0.3 | BF16 | L4-1-24G | €0.93/hour | ~€679/month |
Pixtral-12b-2409 | BF16 | H100-1-80G | €3.40/hour | ~€2482/month |
Mistral-nemo-instruct-2407 | FP8 | H100-1-80G | €3.40/hour | ~€2482/month |
Mixtral-8x7b-instruct-v0.1 | FP8 | H100-1-80G | €3.40/hour | ~€2482/month |
BGE-Multilingual-Gemma2 | FP32 | L4-1-24G | €0.93/hour | ~€679/month |
Qwen2.5-coder-32b-instruct | INT8 | H100-1-80G | €3.40/hour | ~€2482/month |
More models and conditions available on this page.
Your AI endpoints are accessible through low-latency and secure connection to your resources hosted at Scaleway, thanks to a resilient regional Private Network.
We make generative AI endpoints compatible with Scaleway's Identity and Access Management, so that your deployments are compliant with your enterprise architecture requirements.
Identify bottlenecks on your deployments, view inference requests in real time and even report your energy consumption with a fully managed observability solution.
You'll find here a comprehensive guide on getting started, including details on deployment, security, and billing.
If you need support, don't hesitate to reach out to us through the dedicated slack community #inference-beta
Scaleway’s AI services implement robust security measures to ensure customer data privacy and integrity. Our measures and policies are published on our documentation.
Scaleway lets you seamlessly transition applications already utilizing OpenAI. You can use any of the OpenAI official libraries, for example the OpenAI Python client library, to interact with your Scaleway Managed Inference deployments. Find here the APIs and parameters supported.
Managed Inference deploys AI models and creates dedicated endpoints on a secure production infrastructure.
Alternatively, Scaleway has a selection of hosted models in its datacenters, priced per million tokens consumed, available via API. Find all details on the Generative APIs page.
Tell us the good and the bad about your experience here. Thank you for your time!