Choosing the right GPU Instance type
A GPU Instance refers to a virtual computing environment provided by Scaleway that offers access to powerful Graphics Processing Units (GPUs) over the internet. GPUs are specialized hardware originally designed for rendering graphics in video games and other 3D applications. However, their massively parallel architecture makes them ideal for various high-performance computing tasks, such as deep learning, massive machine learning, data processing, scientific simulations, and more.
Scaleway GPU Instances’ availability has revolutionized how researchers, developers, and organizations train complex machine-learning models faster and more efficiently. It empowers European AI startups, giving them the tools (without the need for a huge CAPEX investment) to create products that revolutionize how we work and live.
How to choose the right GPU Instance type
Scaleway provides a range of GPU Instance offers, from GPU RENDER Instances and H100 PCIe Instances to custom build clusters. There are several factors to consider when choosing the right GPU Instance type to ensure that it meets your performance, budget, and scalability requirements. Below, you will find a guide to help you make an informed decision:
- Workload requirements: Identify the nature of your workload. Are you running machine learning, deep learning, high-performance computing (HPC), data analytics, or graphics-intensive applications? Different Instance types are optimized for different types of workloads. For example, the H100 is not designed for graphics rendering. However, other models are. As stated by Tim Dettmers, “Tensor Cores are most important, followed by the memory bandwidth of a GPU, the cache hierarchy, and only then FLOPS of a GPU.”. For more information, refer to the NVIDIA GPU portfolio.
- Performance requirements: Evaluate the performance specifications you need, such as the number of GPUs, GPU memory, processing power, and network bandwidth. You need a lot of memory and fast storage for demanding tasks like training larger Deep Learning models.
- GPU type: Scaleway offers different GPU types, such as various NVIDIA GPUs. Each GPU has varying levels of performance, memory, and capabilities. Choose a GPU that aligns with your specific workload requirements.
- GPU memory: GPU memory bandwidth is an important criterion influencing overall performance. Then, larger GPU memory (VRAM) is crucial for memory-intensive tasks like training larger deep learning models, especially when using larger batch sizes. Modern GPUs offer specialized data formats designed to optimize deep learning performance. These formats, including Bfloat16, FP8, int8 and int4, enable the storage of more data in memory and can enhance performance (for example, moving from FP16 to FP8 can double the number of TFLOPS). To make an informed decision, it is thus crucial to select the appropriate architecture. Options range from Pascal and Ampere to Ada Lovelace and Hopper. Ensuring that the GPU possesses sufficient memory capacity to accommodate your specific workload is essential, preventing any potential memory-related bottlenecks. Equally important, is matching the GPU’s memory type to the nature of your workload.
- CPU and RAM: A powerful CPU can be beneficial for tasks that involve preprocessing or post-processing. Sufficient system memory is also crucial to prevent memory-related bottlenecks or to cache your data in RAM.
- GPU driver and software compatibility: Ensure that the GPU Instance type you choose supports the GPU drivers and software frameworks you need for your workload. This includes CUDA libraries, machine learning frameworks (TensorFlow, PyTorch, etc.), and other specific software tools. For all Scaleway GPU OS images, we offer a driver version that enables the use of all GPUs, from the oldest to the latest models. As is the NGC CLI,
nvidia-docker
is preinstalled, enabling containers to be used with CUDA, cuDNN, and the main deep learning frameworks. - Scaling: Consider the scalability requirements of your workload. The most efficient way to scale up your workload is by using:
- Bigger GPU
- Up to 2 PCIe GPU with H100 Instances or 8 PCIe GPU with L4 or L4OS Instances.
- An HGX-based server setup with 8x NVlink GPUs
- A supercomputer architecture for a larger setup for workload-intensive tasks
- Another way to scale your workload is to use Kubernetes and MIG: You can divide a single H100 GPU into as many as 7 MIG partitions. This means that instead of employing seven P100 GPUs to set up seven K8S pods, you could opt for a single H100 GPU with MIG to effectively deploy all seven K8S pods.
- Online resources: Check for online resources, forums, and community discussions related to the specific GPU type you are considering. This can provide insights into common issues, best practices, and optimizations.
Remember that there is no one-size-fits-all answer, and the right GPU Instance type will depend on your workload’s unique requirements and budget. It is important that you regularly reassess your choice as your workload evolves. Depending on which type best fits your evolving tasks, you can easily migrate from one GPU Instance type to another.
GPU Instances and AI Supercomputer comparison table
Scaleway GPU Instances types overview
RENDER-S | H100-1-80G | H100-2-80G | |
---|---|---|---|
GPU Type | 1x P100 PCIe3 | 1x H100 PCIe5 | 2x H100 PCIe5 |
NVIDIA architecture | Pascal 2016 | Hopper 2022 | Hopper 2022 |
Tensor Cores | N/A | Yes | Yes |
Performance (training in FP16 Tensor Cores) | (No Tensor Cores : 9,3 TFLOPS FP32) | 1513 TFLOPS | 2x 1513 TFLOPS |
VRAM | 16 GB CoWoS HBM2 (Memory bandwidth: 732 GB/s) | 80 GB HBM2E (Memory bandwidth: 2TB/s) | 2x80 GB HBM2E (Memory bandwidth: 2TB/s) |
CPU Type | Intel Xeon Gold 6148 (2.4 GHz) | AMD EPYC™ 9334 (2.7GHz) | AMD EPYC™ 9334 (2.7GHz) |
vCPUs | 10 | 24 | 48 |
RAM | 42 GB DDR3 | 240 GB DDR5 | 480 GB DDR5 |
Storage | Block/Local | Block | Block |
Scratch Storage | No | Yes (3 TB NVMe) | Yes (6 TB NVMe) |
MIG compatibility | No | Yes | Yes |
Bandwidth | 1 Gbps | 10 Gbps | 20 Gbps |
Better used for | Image / Video encoding (4K) | 7B LLM Fine-Tuning / Inference | 70B LLM Fine-Tuning / Inference |
What they are not made for | Large models (especially LLM) | Graphic or video encoding use cases | Graphic or video encoding use cases |
L4-1-24G | L4-2-24G | L4-4-24G | L4-8-24G | |
---|---|---|---|---|
GPU Type | 1x L4 PCIe4 | 2x L4 PCIe4 | 4x L4PCIe4 | 8x L4 PCIe4 |
NVIDIA architecture | Lovelace 2022 | Lovelace 2022 | Lovelace 2022 | Lovelace 2022 |
Tensor Cores | Yes | Yes | Yes | Yes |
Performance (training in FP16 Tensor Cores) | 242 TFLOPS | 2x 242 TFLOPS | 4x 242 TFLOPS | 8x 242 TFLOPS |
VRAM | 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s) | 2x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s) | 4x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s) | 8x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s) |
CPU Type | AMD EPYC™ 7413 (2.65GHz) | AMD EPYC™ 7413 (2.65GHz) | AMD EPYC™ 7413 (2.65GHz) | AMD EPYC™ 7413 (2.65GHz |
vCPUs | 8 | 16 | 32 | 64 |
RAM | 48 GB DDR4 | 96 GB DDR4 | 192 GB DDR4 | 384 GB DDR4 |
Storage | Block | Block | Block | Block |
Scratch Storage | No | No | No | No |
MIG compatibility | No | No | No | No |
Bandwidth | 2.5 Gbps | 5 Gbps | 10 Gbps | 20 Gbps |
Better used for | Image Enconding (8K) | Video Enconding (8K) | 7B LLM Inference | 70B LLM Inference |
What they are not made for | Training of LLM | Training of LLM | Training of LLM | Training of LLM |
L40S-1-48G | L40S-2-48G | L40S-4-48G | L40S-8-48G | |
---|---|---|---|---|
GPU Type | 1x L40S 48GB PCIe | 2x L40S 48GB PCIe | 4x L40S 48GB PCIe | 8x L40S 48GB PCIe |
NVIDIA architecture | Lovelace 2022 | Lovelace 2022 | Lovelace 2022 | Lovelace 2022 |
Tensor Cores | Yes | Yes | Yes | Yes |
Performance (training in FP16 Tensor Cores) | 362 TFLOPS | 724 TFLOPS | 1448 TFLOPS | 2896 TFLOPS |
VRAM | 48 GB GDDR6 (Memory bandwidth: 864 GB/s) | 2x 48 GB = 96 GB GDDR6 (Memory bandwidth: 864 GB/s) | 4x 48 GB = 192 GB GDDR6 (Memory bandwidth: 864 GB/s) | 8x 48 GB = 384 GB GDDR6 (Memory bandwidth: 864 GB/s) |
CPU Type | AMD EPYC™ 7413 (2.65GHz) | AMD EPYC™ 7413 (2.65GHz) | AMD EPYC™ 7413 (2.65GHz) | AMD EPYC™ 7413 (2.65GHz) |
vCPUs | 8 | 16 | 32 | 64 |
RAM | 96 GB DDR4 | 192 GB DDR4 | 384 GB DDR4 | 768 GB DDR4 |
Storage | Block | Block | Block | Block |
Scratch Storage | Yes (~1,6 TB NVMe) | Yes (~3,2 TB NVMe) | Yes (~6,4 TB NVMe) | Yes (~12,8 TB NVMe) |
MIG compatibility | No | No | No | No |
Bandwidth | 2,5 Gbps | 5 Gbps | 10 Gbps | 20 Gbps |
Use cases | GenAI (Image/Video) | GenAI (Image/Video) | 7B Text-to-image model fine-tuning / Inference | 70B text-to-image model fine-tuning / Inference |
What they are not made for |
Scaleway AI Supercomputer
Custom build clusters (2DGX H100, 16 H100 GPUs) | Custom build clusters (127 DGX H100, 1016 H100 GPUs) | |
---|---|---|
GPU Type | 16x H100 (SXM5) | 1,016x H100 (SXM5) |
NVIDIA architecture | Hopper 2022 | Hopper 2022 |
Tensor Cores | Yes | Yes |
Performance in PFLOPs FP8 Tensor Core | Up to 63.2 PFLOPS | Up to 4,021.3 PFLOPS |
VRAM | 1280 GB (total cluster) | 81,280GB (total cluster) |
CPU Type | Dual Intel® Xeon® Platinum 8480C Processors (3.8 GHz) | Dual Intel® Xeon® Platinum 8480C Processors (3.8 GHz) |
Total CPU cores | 224 cores (total cluster) | 14,224 cores (total cluster) |
RAM | 4 TB (total cluster) | 254 TB (total cluster) |
Storage | 64TB of a3i DDN low latency storage | 1.8 PB of a3i DDN low latency storage |
MIG compatibility | Yes | Yes |
Inter-GPU bandwidth | Infiniband 400 Gb/s | Infiniband 400 Gb/s |
NVIDIA GH200 Superchip
GH200 Grace Hopper™ | |
---|---|
GPU Type | NVIDIA GH200 Grace Hopper™ Superchip |
NVIDIA architecture | GH200 Grace Hopper™ Architecture |
Performance | 990 TFLops (in FP166 Tensor Core) |
Specifications | - GH200 SuperChip with 72 ARM Neoverse V2 cores - 480 GB of LPDDR5X DRAM - 96GB of HBM3 GPU memory (Memory is fully merged for up to 576GB of global usable memory) |
MIG compatibility | Yes |
Inter-GPU bandwidth (for clusters up to 256 GH200) | NVlink Switch System 900 GB/s |
Format & Features | Single chip up to GH200 clusters. (For larger setup needs, contact us) |
Use cases | - Extra large LLM and DL model inference - HPC |
What they are not made for | - Graphism - (Training) |