Choose among ready-to-be-served AI models
What makes inference fast? Among many things, model optimization. That's why Scaleway is providing an evolutionary Model Library, offering curated and optimized models, including LLMs and embeddings.
Serve Generative AI models and answer prompts from European end-consumers securely. A drop-in replacement for all apps using OpenAI APIs.
What makes inference fast? Among many things, model optimization. That's why Scaleway is providing an evolutionary Model Library, offering curated and optimized models, including LLMs and embeddings.
No matter how big your usage is, you pay the same —predictable— price for unlimited tokens. This price depends on the dedicated infrastructure that serves your model, which is billed per hour.
Maintain complete data control: your prompts and responses are not stored and cannot be accessed by Scaleway or any third parties, keeping your data exclusively yours and within Europe.
Llama 3 by Meta is the latest iteration of the open-access Llama family, for efficient deployment and development on smaller GPUs. Llama models are tailored for dynamic dialogues and creative text generation. Engineered with the latest in efficiency and scalability, it excels in complex reasoning and coding tasks. Its advanced Grouped-Query Attention mechanism ensures unparalleled processing prowess, making it the ultimate tool for chat applications and beyond.
Llama-3-70b-Instruct by Meta is the most powerful model in the Llama family, boasting 70 billion parameters for unmatched text generation. Ideal for advanced AI applications, it excels in dynamic dialogues, complex reasoning, and intricate coding tasks. Its Grouped-Query Attention mechanism ensures efficient processing, making it the top choice for high-end deployment.
A lightweight dense model released by Mistral AI, matching capabilities of models up to 30B parameters. The v0.3 has an extended context window and is distributed under the Apache 2.0 license.
Trained on Scaleway's Nabuchodonosor 2023, Mixtral-8x7B is a state-of-the-art, pretrained generative model known as a Sparse Mixture of Experts. It has been benchmarked to surpass the performance of the Llama 2 70B model across a variety of tests.
An advanced embedding model that translates data into vectors, capturing complex relationships for enhanced information processing. Perfect to set up your Retrieval-Augmented Generation (RAG) system.
You are charged for usage of the GPU type you choose.
Model | Quantization | GPU | Price | Approx. per month |
Llama3-8b-instruct | BF16 | L4-1-24G | €0.93/hour | ~€679/month |
Llama3-70b-instruct | INT8 | H100-1-80G | €3.40/hour | ~€2482/month |
Mistral-7b-instruct-v0.3 | BF16 | L4-1-24G | €0.93/hour | ~€679/month |
Mixtral-8x7b-instruct-v0.1 | INT8 | H100-1-80G | €3.40/hour | ~€2482/month |
Sentence-t5-xxl | FP32 | L4-1-24G | €0.93/hour | ~€679/month |
More models and conditions available on this page.
Your AI endpoints are accessible through low-latency and secure connection to your resources hosted at Scaleway, thanks to a resilient regional Private Network.
We make generative AI endpoints compatible with Scaleway's Identity and Access Management, so that your deployments are compliant with your enterprise architecture requirements.
Identify bottlenecks on your deployments, view inference requests in real time and even report your energy consumption with a fully managed observability solution.