Was this page helpful?

Choosing the right GPU Instance type

Reviewed on 22 April 2025 • Published on 31 August 2022

A GPU Instance refers to a virtual computing environment provided by Scaleway that offers access to powerful Graphics Processing Units (GPUs) over the internet. GPUs are specialized hardware originally designed for rendering graphics in video games and other 3D applications. However, their massively parallel architecture makes them ideal for various high-performance computing tasks, such as deep learning, massive machine learning, data processing, scientific simulations, and more.

Scaleway GPU Instances’ availability has revolutionized how researchers, developers, and organizations train complex machine-learning models faster and more efficiently. It empowers European AI startups, giving them the tools (without the need for a huge CAPEX investment) to create products that revolutionize how we work and live.

How to choose the right GPU Instance typeLink to this anchor

Scaleway provides a range of GPU Instance offers, from GPU RENDER Instances and H100 SXM Instances to custom build clusters. There are several factors to consider when choosing the right GPU Instance type to ensure that it meets your performance, budget, and scalability requirements. Below, you will find a guide to help you make an informed decision:

Workload requirements: Identify the nature of your workload. Are you running machine learning, deep learning, high-performance computing (HPC), data analytics, or graphics-intensive applications? Different Instance types are optimized for different types of workloads. For example, the H100 is not designed for graphics rendering. However, other models are. As stated by Tim Dettmers, “Tensor Cores are most important, followed by the memory bandwidth of a GPU, the cache hierarchy, and only then FLOPS of a GPU.”. For more information, refer to the NVIDIA GPU portfolio.
Performance requirements: Evaluate the performance specifications you need, such as the number of GPUs, GPU memory, processing power, and network bandwidth. You need a lot of memory and fast storage for demanding tasks like training larger Deep Learning models.
GPU type: Scaleway offers different GPU types, such as various NVIDIA GPUs. Each GPU has varying levels of performance, memory, and capabilities. Choose a GPU that aligns with your specific workload requirements.
GPU memory: GPU memory bandwidth is an important criterion influencing overall performance. Then, larger GPU memory (VRAM) is crucial for memory-intensive tasks like training larger deep learning models, especially when using larger batch sizes. Modern GPUs offer specialized data formats designed to optimize deep learning performance. These formats, including Bfloat16, FP8, int8 and int4, enable the storage of more data in memory and can enhance performance (for example, moving from FP16 to FP8 can double the number of TFLOPS). To make an informed decision, it is thus crucial to select the appropriate architecture. Options range from Pascal and Ampere to Ada Lovelace and Hopper. Ensuring that the GPU possesses sufficient memory capacity to accommodate your specific workload is essential, preventing any potential memory-related bottlenecks. Equally important, is matching the GPU’s memory type to the nature of your workload.
CPU and RAM: A powerful CPU can be beneficial for tasks that involve preprocessing or post-processing. Sufficient system memory is also crucial to prevent memory-related bottlenecks or to cache your data in RAM.
GPU driver and software compatibility: Ensure that the GPU Instance type you choose supports the GPU drivers and software frameworks you need for your workload. This includes CUDA libraries, machine learning frameworks (TensorFlow, PyTorch, etc.), and other specific software tools. For all Scaleway GPU OS images, we offer a driver version that enables the use of all GPUs, from the oldest to the latest models. As is the NGC CLI, nvidia-docker is preinstalled, enabling containers to be used with CUDA, cuDNN, and the main deep learning frameworks.
Scaling: Consider the scalability requirements of your workload. The most efficient way to scale up your workload is by using:
- Bigger GPU
- Up to 2 PCIe GPU with H100 Instances or 8 PCIe GPU with L4 or L4OS Instances.
- Or better, an HGX-based server setup with up to 8x NVlink GPUs with H100-SXM Instances
- A supercomputer architecture for a larger setup for workload-intensive tasks
- Another way to scale your workload is to use Kubernetes and MIG: You can divide a single H100 or H100-SXM GPU into as many as 7 MIG partitions. This means that instead of employing seven P100 GPUs to set up seven K8S pods, you could opt for a single H100 GPU with MIG to effectively deploy all seven K8S pods.
Online resources: Check for online resources, forums, and community discussions related to the specific GPU type you are considering. This can provide insights into common issues, best practices, and optimizations.

Remember that there is no one-size-fits-all answer, and the right GPU Instance type will depend on your workload’s unique requirements and budget. It is important that you regularly reassess your choice as your workload evolves. Depending on which type best fits your evolving tasks, you can easily migrate from one GPU Instance type to another.

GPU Instances and AI Supercomputer comparison tableLink to this anchor

Scaleway GPU Instances types overviewLink to this anchor

	RENDER-S	H100-1-80G	H100-2-80G
GPU Type	1x P100 PCIe3	1x H100 PCIe5	2x H100 PCIe5
NVIDIA architecture	Pascal 2016	Hopper 2022	Hopper 2022
Tensor Cores	N/A	Yes	Yes
Performance (training in FP16 Tensor Cores)	(No Tensor Cores : 9,3 TFLOPS FP32)	1513 TFLOPS	2x 1513 TFLOPS
VRAM	16 GB CoWoS HBM2 (Memory bandwidth: 732 GB/s)	80 GB HBM2E (Memory bandwidth: 2TB/s)	2x80 GB HBM2E (Memory bandwidth: 2TB/s)
CPU Type	Intel Xeon Gold 6148 (2.4 GHz)	AMD EPYC™ 9334 (2.7GHz)	AMD EPYC™ 9334 (2.7GHz)
vCPUs	10	24	48
RAM	42 GB DDR3	240 GB DDR5	480 GB DDR5
Storage	Block/Local	Block	Block
Scratch Storage	No	Yes (3 TB NVMe)	Yes (6 TB NVMe)
MIG compatibility	No	Yes	Yes
Bandwidth	1 Gbps	10 Gbps	20 Gbps
Better used for	Image / Video encoding (4K)	7B LLM Fine-Tuning / Inference	70B LLM Fine-Tuning / Inference
What they are not made for	Large models (especially LLM)	Graphic or video encoding use cases	Graphic or video encoding use cases

	H100-SXM-2-80G	H100-SXM-4-80G	H100-SXM-8-80G
GPU Type	2x H100-SXM SXM	4x H100-SXM SXM	8x H100-SXM SXM
NVIDIA architecture	Hopper 2022	Hopper 2022	Hopper 2022
Tensor Cores	Yes	Yes	Yes
Performance (training in FP16 Tensor Cores)	2x 1979 TFLOPS	4x 1979 TFLOPS	8x 1979 TFLOPS
VRAM	2x 80 GB HBM3 (Memory bandwidth: 3.35TB/s)	4x 80 GB HBM3 (Memory bandwidth: 3.35TB/s)	8x 80 GB HBM3 (Memory bandwidth: 3.35TB/s)
CPU Type	Xeon Platinum 8452Y (2.0 GHz)	Xeon Platinum 8452Y (2.0 GHz)	Xeon Platinum 8452Y (2.0 GHz)
vCPUs	32	64	128
RAM	240 GB DDR5	480 GB DDR5	960 GB DDR5
Storage	Boot on Block 5K	Boot on Block 5K	Boot on Block 5K
Scratch Storage	Yes (~3 TB)	Yes (~6 TB)	Yes (~12 TB)
MIG compatibility	Yes	Yes	Yes
Bandwidth	20 Gbps	20 Gbps	20 Gbps
Network technology	NVLink	NVLink	NVLink
Better used for	LLM fine-tuning, LLM inference with lower quantization and/or larger parameter counts, fast computer vision training model training	LLM fine-tuning, LLM inference with lower quantization and/or larger parameter counts, fast computer vision training model training	Llama 4 or Deepseek R1 inference
What they are not made for	Training of LLM (single node), graphic or video encoding use cases	Training of LLM (single node), Graphic or video encoding use cases	Training of LLM (single node), graphic or video encoding use cases

	L4-1-24G	L4-2-24G	L4-4-24G	L4-8-24G
GPU Type	1x L4 PCIe4	2x L4 PCIe4	4x L4PCIe4	8x L4 PCIe4
NVIDIA architecture	Lovelace 2022	Lovelace 2022	Lovelace 2022	Lovelace 2022
Tensor Cores	Yes	Yes	Yes	Yes
Performance (training in FP16 Tensor Cores)	242 TFLOPS	2x 242 TFLOPS	4x 242 TFLOPS	8x 242 TFLOPS
VRAM	24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)	2x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)	4x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)	8x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)
CPU Type	AMD EPYC™ 7413 (2.65GHz)	AMD EPYC™ 7413 (2.65GHz)	AMD EPYC™ 7413 (2.65GHz)	AMD EPYC™ 7413 (2.65GHz
vCPUs	8	16	32	64
RAM	48 GB DDR4	96 GB DDR4	192 GB DDR4	384 GB DDR4
Storage	Block	Block	Block	Block
Scratch Storage	No	No	No	No
MIG compatibility	No	No	No	No
Bandwidth	2.5 Gbps	5 Gbps	10 Gbps	20 Gbps
Better used for	Image Enconding (8K)	Video Enconding (8K)	7B LLM Inference	70B LLM Inference
What they are not made for	Training of LLM	Training of LLM	Training of LLM	Training of LLM

	L40S-1-48G	L40S-2-48G	L40S-4-48G	L40S-8-48G
GPU Type	1x L40S 48GB PCIe	2x L40S 48GB PCIe	4x L40S 48GB PCIe	8x L40S 48GB PCIe
NVIDIA architecture	Lovelace 2022	Lovelace 2022	Lovelace 2022	Lovelace 2022
Tensor Cores	Yes	Yes	Yes	Yes
Performance (training in FP16 Tensor Cores)	362 TFLOPS	724 TFLOPS	1448 TFLOPS	2896 TFLOPS
VRAM	48 GB GDDR6 (Memory bandwidth: 864 GB/s)	2x 48 GB = 96 GB GDDR6 (Memory bandwidth: 864 GB/s)	4x 48 GB = 192 GB GDDR6 (Memory bandwidth: 864 GB/s)	8x 48 GB = 384 GB GDDR6 (Memory bandwidth: 864 GB/s)
CPU Type	AMD EPYC™ 7413 (2.65GHz)	AMD EPYC™ 7413 (2.65GHz)	AMD EPYC™ 7413 (2.65GHz)	AMD EPYC™ 7413 (2.65GHz)
vCPUs	8	16	32	64
RAM	96 GB DDR4	192 GB DDR4	384 GB DDR4	768 GB DDR4
Storage	Block	Block	Block	Block
Scratch Storage	Yes (~1,6 TB NVMe)	Yes (~3,2 TB NVMe)	Yes (~6,4 TB NVMe)	Yes (~12,8 TB NVMe)
MIG compatibility	No	No	No	No
Bandwidth	2,5 Gbps	5 Gbps	10 Gbps	20 Gbps
Use cases	GenAI (Image/Video)	GenAI (Image/Video)	7B Text-to-image model fine-tuning / Inference	70B text-to-image model fine-tuning / Inference
What they are not made for

Scaleway AI SupercomputerLink to this anchor

	Custom build clusters (2DGX H100, 16 H100 GPUs)	Custom build clusters (127 DGX H100, 1016 H100 GPUs)
GPU Type	16x H100 (SXM5)	1,016x H100 (SXM5)
NVIDIA architecture	Hopper 2022	Hopper 2022
Tensor Cores	Yes	Yes
Performance in PFLOPs FP8 Tensor Core	Up to 63.2 PFLOPS	Up to 4,021.3 PFLOPS
VRAM	1280 GB (total cluster)	81,280GB (total cluster)
CPU Type	Dual Intel® Xeon® Platinum 8480C Processors (3.8 GHz)	Dual Intel® Xeon® Platinum 8480C Processors (3.8 GHz)
Total CPU cores	224 cores (total cluster)	14,224 cores (total cluster)
RAM	4 TB (total cluster)	254 TB (total cluster)
Storage	64TB of a3i DDN low latency storage	1.8 PB of a3i DDN low latency storage
MIG compatibility	Yes	Yes
Inter-GPU bandwidth	Infiniband 400 Gb/s	Infiniband 400 Gb/s

NVIDIA GH200 SuperchipLink to this anchor

	GH200 Grace Hopper™
GPU Type	NVIDIA GH200 Grace Hopper™ Superchip
NVIDIA architecture	GH200 Grace Hopper™ Architecture
Performance	990 TFLops (in FP166 Tensor Core)
Specifications	- GH200 SuperChip with 72 ARM Neoverse V2 cores - 480 GB of LPDDR5X DRAM - 96GB of HBM3 GPU memory (Memory is fully merged for up to 576GB of global usable memory)
MIG compatibility	Yes
Inter-GPU bandwidth (for clusters up to 256 GH200)	NVlink Switch System 900 GB/s
Format & Features	Single chip up to GH200 clusters. (For larger setup needs, contact us)
Use cases	- Extra large LLM and DL model inference - HPC
What they are not made for	- Graphism - (Training)

Was this page helpful?