Managed Inference

Serve Generative AI models and answer prompts from European end-consumers securely. A drop-in replacement for all apps using OpenAI APIs.

Choose among ready-to-be-served AI models

What makes inference fast? Among many things, model optimization. That's why Scaleway is providing an evolutionary Model Library, offering curated and optimized models, including LLMs and embeddings.

Enjoy unlimited tokens at predictable price

No matter how big your usage is, you pay the same —predictable— price for unlimited tokens. This price depends on the dedicated infrastructure that serves your model, which is billed per hour.

Run on a fully secured European Cloud

Maintain complete data control: your prompts and responses are not stored and cannot be accessed by Scaleway or any third parties, keeping your data exclusively yours and within Europe.

Available zones:
Paris:PAR 2

Open-weights language and embedding models

Llama-3-8b-instruct

Llama 3 by Meta is the latest iteration of the open-access Llama family, for efficient deployment and development on smaller GPUs. Llama models are tailored for dynamic dialogues and creative text generation. Engineered with the latest in efficiency and scalability, it excels in complex reasoning and coding tasks. Its advanced Grouped-Query Attention mechanism ensures unparalleled processing prowess, making it the ultimate tool for chat applications and beyond.

Predictable pricing

Pick among on-the-shelf optimized models, and get a dedicated inference endpoint right away.

You are charged for usage of the GPU type you choose.


ModelQuantizationGPUPriceApprox. per month
Llama3-8b-instructBF16L4-1-24G€0.93/hour~€679/month
Llama3-70b-instructINT8H100-1-80G€3.40/hour~€2482/month
Mistral-7b-instruct-v0.3BF16L4-1-24G€0.93/hour~€679/month
Mixtral-8x7b-instruct-v0.1INT8H100-1-80G€3.40/hour~€2482/month
Sentence-t5-xxlFP32L4-1-24G€0.93/hour~€679/month



More models and conditions available on this page.

Benefit from a secured European Cloud ecosystem

Virtual Private Cloud

Your AI endpoints are accessible through low-latency and secure connection to your resources hosted at Scaleway, thanks to a resilient regional Private Network.

Learn more

Access Management

We make generative AI endpoints compatible with Scaleway's Identity and Access Management, so that your deployments are compliant with your enterprise architecture requirements.

Learn more

Cockpit

Identify bottlenecks on your deployments, view inference requests in real time and even report your energy consumption with a fully managed observability solution.

Learn more