NavigationContentFooter
Jump toSuggest an edit

Choosing the right GPU Instance type

Reviewed on 14 October 2024Published on 31 August 2022

A GPU Instance refers to a virtual computing environment provided by Scaleway that offers access to powerful Graphics Processing Units (GPUs) over the internet. GPUs are specialized hardware originally designed for rendering graphics in video games and other 3D applications. However, their massively parallel architecture makes them ideal for various high-performance computing tasks, such as deep learning, massive machine learning, data processing, scientific simulations, and more.

Scaleway GPU Instances’ availability has revolutionized how researchers, developers, and organizations train complex machine-learning models faster and more efficiently. It empowers European AI startups, giving them the tools (without the need for a huge CAPEX investment) to create products that revolutionize how we work and live.

How to choose the right GPU Instance type

Scaleway provides a range of GPU Instance offers, from GPU RENDER Instances and H100 PCIe Instances to custom build clusters. There are several factors to consider when choosing the right GPU Instance type to ensure that it meets your performance, budget, and scalability requirements. Below, you will find a guide to help you make an informed decision:

  • Workload requirements: Identify the nature of your workload. Are you running machine learning, deep learning, high-performance computing (HPC), data analytics, or graphics-intensive applications? Different Instance types are optimized for different types of workloads. For example, the H100 is not designed for graphics rendering. However, other models are. As stated by Tim Dettmers, “Tensor Cores are most important, followed by the memory bandwidth of a GPU, the cache hierarchy, and only then FLOPS of a GPU.”. For more information, refer to the NVIDIA GPU portfolio.
  • Performance requirements: Evaluate the performance specifications you need, such as the number of GPUs, GPU memory, processing power, and network bandwidth. You need a lot of memory and fast storage for demanding tasks like training larger Deep Learning models.
  • GPU type: Scaleway offers different GPU types, such as various NVIDIA GPUs. Each GPU has varying levels of performance, memory, and capabilities. Choose a GPU that aligns with your specific workload requirements.
  • GPU memory: GPU memory bandwidth is an important criterion influencing overall performance. Then, larger GPU memory (VRAM) is crucial for memory-intensive tasks like training larger deep learning models, especially when using larger batch sizes. Modern GPUs offer specialized data formats designed to optimize deep learning performance. These formats, including Bfloat16, FP8, int8 and int4, enable the storage of more data in memory and can enhance performance (for example, moving from FP16 to FP8 can double the number of TFLOPS). To make an informed decision, it is thus crucial to select the appropriate architecture. Options range from Pascal and Ampere to Ada Lovelace and Hopper. Ensuring that the GPU possesses sufficient memory capacity to accommodate your specific workload is essential, preventing any potential memory-related bottlenecks. Equally important, is matching the GPU’s memory type to the nature of your workload.
  • CPU and RAM: A powerful CPU can be beneficial for tasks that involve preprocessing or post-processing. Sufficient system memory is also crucial to prevent memory-related bottlenecks or to cache your data in RAM.
  • GPU driver and software compatibility: Ensure that the GPU Instance type you choose supports the GPU drivers and software frameworks you need for your workload. This includes CUDA libraries, machine learning frameworks (TensorFlow, PyTorch, etc.), and other specific software tools. For all Scaleway GPU OS images, we offer a driver version that enables the use of all GPUs, from the oldest to the latest models. As is the NGC CLI, nvidia-docker is preinstalled, enabling containers to be used with CUDA, cuDNN, and the main deep learning frameworks.
  • Scaling: Consider the scalability requirements of your workload. The most efficient way to scale up your workload is by using:
    • Bigger GPU
    • Up to 2 PCIe GPU with H100 Instances or 8 PCIe GPU with L4 or L4OS Instances.
    • An HGX-based server setup with 8x NVlink GPUs
    • A supercomputer architecture for a larger setup for workload-intensive tasks
    • Another way to scale your workload is to use Kubernetes and MIG: You can divide a single H100 GPU into as many as 7 MIG partitions. This means that instead of employing seven P100 GPUs to set up seven K8S pods, you could opt for a single H100 GPU with MIG to effectively deploy all seven K8S pods.
  • Online resources: Check for online resources, forums, and community discussions related to the specific GPU type you are considering. This can provide insights into common issues, best practices, and optimizations.

Remember that there is no one-size-fits-all answer, and the right GPU Instance type will depend on your workload’s unique requirements and budget. It is important that you regularly reassess your choice as your workload evolves. Depending on which type best fits your evolving tasks, you can easily migrate from one GPU Instance type to another.

GPU Instances and AI Supercomputer comparison table

Scaleway GPU Instances types overview

RENDER-SH100-1-80GH100-2-80G
GPU Type1x P100 PCIe31x H100 PCIe52x H100 PCIe5
NVIDIA architecturePascal 2016Hopper 2022Hopper 2022
Tensor CoresN/AYesYes
Performance (training in FP16 Tensor Cores)(No Tensor Cores : 9,3 TFLOPS FP32)1513 TFLOPS2x 1513 TFLOPS
VRAM16 GB CoWoS HBM2 (Memory bandwidth: 732 GB/s)80 GB HBM2E (Memory bandwidth: 2TB/s)2x80 GB HBM2E (Memory bandwidth: 2TB/s)
CPU TypeIntel Xeon Gold 6148 (2.4 GHz)AMD EPYC™ 9334 (2.7GHz)AMD EPYC™ 9334 (2.7GHz)
vCPUs102448
RAM42 GB DDR3240 GB DDR5480 GB DDR5
StorageBlock/LocalBlockBlock
Scratch StorageNoYes (3 TB NVMe)Yes (6 TB NVMe)
MIG compatibilityNoYesYes
Bandwidth1 Gbps10 Gbps20 Gbps
Better used forImage / Video encoding (4K)7B LLM Fine-Tuning / Inference70B LLM Fine-Tuning / Inference
What they are not made forLarge models (especially LLM)Graphic or video encoding use casesGraphic or video encoding use cases
L4-1-24GL4-2-24GL4-4-24GL4-8-24G
GPU Type1x L4 PCIe42x L4 PCIe44x L4PCIe48x L4 PCIe4
NVIDIA architectureLovelace 2022Lovelace 2022Lovelace 2022Lovelace 2022
Tensor CoresYesYesYesYes
Performance (training in FP16 Tensor Cores)242 TFLOPS2x 242 TFLOPS4x 242 TFLOPS8x 242 TFLOPS
VRAM24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)2x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)4x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)8x 24 GB RAM GDDR6 (Memory bandwidth: 300 GB/s)
CPU TypeAMD EPYC™ 7413 (2.65GHz)AMD EPYC™ 7413 (2.65GHz)AMD EPYC™ 7413 (2.65GHz)AMD EPYC™ 7413 (2.65GHz
vCPUs8163264
RAM48 GB DDR496 GB DDR4192 GB DDR4384 GB DDR4
StorageBlockBlockBlockBlock
Scratch StorageNoNoNoNo
MIG compatibilityNoNoNoNo
Bandwidth2.5 Gbps5 Gbps10 Gbps20 Gbps
Better used forImage Enconding (8K)Video Enconding (8K)7B LLM Inference70B LLM Inference
What they are not made forTraining of LLMTraining of LLMTraining of LLMTraining of LLM
L40S-1-48GL40S-2-48GL40S-4-48GL40S-8-48G
GPU Type1x L40S 48GB PCIe2x L40S 48GB PCIe4x L40S 48GB PCIe8x L40S 48GB PCIe
NVIDIA architectureLovelace 2022Lovelace 2022Lovelace 2022Lovelace 2022
Tensor CoresYesYesYesYes
Performance (training in FP16 Tensor Cores)362 TFLOPS724 TFLOPS1448 TFLOPS2896 TFLOPS
VRAM48 GB GDDR6 (Memory bandwidth: 864 GB/s)2x 48 GB = 96 GB GDDR6 (Memory bandwidth: 864 GB/s)4x 48 GB = 192 GB GDDR6 (Memory bandwidth: 864 GB/s)8x 48 GB = 384 GB GDDR6 (Memory bandwidth: 864 GB/s)
CPU TypeAMD EPYC™ 7413 (2.65GHz)AMD EPYC™ 7413 (2.65GHz)AMD EPYC™ 7413 (2.65GHz)AMD EPYC™ 7413 (2.65GHz)
vCPUs8163264
RAM96 GB DDR4192 GB DDR4384 GB DDR4768 GB DDR4
StorageBlockBlockBlockBlock
Scratch StorageYes (~1,6 TB NVMe)Yes (~3,2 TB NVMe)Yes (~6,4 TB NVMe)Yes (~12,8 TB NVMe)
MIG compatibilityNoNoNoNo
Bandwidth2,5 Gbps5 Gbps10 Gbps20 Gbps
Use casesGenAI (Image/Video)GenAI (Image/Video)7B Text-to-image model fine-tuning / Inference70B text-to-image model fine-tuning / Inference
What they are not made for

Scaleway AI Supercomputer

Custom build clusters (2DGX H100, 16 H100 GPUs)Custom build clusters (127 DGX H100, 1016 H100 GPUs)
GPU Type16x H100 (SXM5)1,016x H100 (SXM5)
NVIDIA architectureHopper 2022Hopper 2022
Tensor CoresYesYes
Performance in PFLOPs FP8 Tensor CoreUp to 63.2 PFLOPSUp to 4,021.3 PFLOPS
VRAM1280 GB (total cluster)81,280GB (total cluster)
CPU TypeDual Intel® Xeon® Platinum 8480C Processors (3.8 GHz)Dual Intel® Xeon® Platinum 8480C Processors (3.8 GHz)
Total CPU cores224 cores (total cluster)14,224 cores (total cluster)
RAM4 TB (total cluster)254 TB (total cluster)
Storage64TB of a3i DDN low latency storage1.8 PB of a3i DDN low latency storage
MIG compatibilityYesYes
Inter-GPU bandwidthInfiniband 400 Gb/sInfiniband 400 Gb/s

NVIDIA GH200 Superchip

GH200 Grace Hopper™
GPU TypeNVIDIA GH200 Grace Hopper™ Superchip
NVIDIA architectureGH200 Grace Hopper™ Architecture
Performance990 TFLops (in FP166 Tensor Core)
Specifications- GH200 SuperChip with 72 ARM Neoverse V2 cores
- 480 GB of LPDDR5X DRAM
- 96GB of HBM3 GPU memory
(Memory is fully merged for up to 576GB of global usable memory)
MIG compatibilityYes
Inter-GPU bandwidth (for clusters up to 256 GH200)NVlink Switch System 900 GB/s
Format & FeaturesSingle chip up to GH200 clusters. (For larger setup needs, contact us)
Use cases- Extra large LLM and DL model inference
- HPC
What they are not made for- Graphism
- (Training)
API DocsScaleway consoleDedibox consoleScaleway LearningScaleway.comPricingBlogCareers
© 2023-2024 – Scaleway