Migrating from H100-2-80G to H100-SXM-2-80G
Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users.
For optimal availability and performance, we recommend switching from H100-2-80G to the improved H100-SXM-2-80G GPU Instance. This latest generation has more stock, improved NVLink, and better and faster VRAM.
Benefits of the migration
There are two primary scenarios: migrating Kubernetes (Kapsule) workloads or standalone workloads.
Migrating Kubernetes workloads (Kubernetes Kapsule)
If you are using Kapsule, follow these steps to move existing workloads to nodes powered by H100-SXM-2-80G GPUs.
Step-by-step
-
Create a new node pool using
H100-SXM-2-80GGPU Instances. -
Run
kubectl get nodesto check that the new nodes are in aReadystate. -
Cordon the nodes in the old node pool to prevent new Pods from being scheduled there. For each node, run:
kubectl cordon <node-name> -
Drain the nodes to evict the Pods gracefully.
- For each node, run:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data - The
--ignore-daemonsetsflag is used because daemon sets manage Pods across all nodes and will automatically reschedule them. - The
--delete-emptydir-dataflag is necessary if your Pods use emptyDir volumes, but use this option carefully as it will delete the data stored in these volumes. - Refer to the official Kubernetes documentation for further information.
- For each node, run:
-
Run
kubectl get pods -o wideafter draining, to verify that the Pods have been rescheduled to the new node pool. -
Delete the old node pool.
Migrating a standalone Instance
For standalone GPU instances, you can recreate your environment using a H100-SXM-2-80G GPU Instance using either the CLI, API or in visual mode using the Scaleway console.
Quick Start (CLI example):
-
Stop the Instance.
scw instance server stop <instance_id> zone=<zone>Replace
<zone>with the Availability Zone of your Instance. For example, if your Instance is located in Paris-1, the zone would befr-par-1. Replace<instance_id>with the ID of your Instance. -
Update the commercial type of the Instance.
scw instance server update <instance_id> commercial-type=H100-SXM-2-80G zone=<zone>Replace
<instance_id>with the UUID of your Instance and<zone>with the Availability Zone of your GPU Instance. -
Power on the Instance.
scw instance server start <instance_id> zone=<zone>
For further information, refer to the Instance CLI documentation.
FAQ
Are PCIe-based H100s being discontinued?
H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to H100-SXM-2-80G to avoid future disruptions.
Is H100-SXM-2-80G compatible with my current setup?
Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). No changes in your code base are required when upgrading to a SXM-based GPU Instance.
Why is the H100-SXM better for multi-GPU workloads?
The NVIDIA H100-SXM outperforms the H100-PCIe in multi-GPU configurations primarily due to its higher interconnect bandwidth and greater power capacity. It uses fourth-generation NVLink and NVSwitch, delivering up to 900 GB/s of bidirectional bandwidth for fast GPU-to-GPU communication. In contrast, the H100-PCIe is limited to a theoretical maximum of 128 GB/s via PCIe Gen 5, which becomes a bottleneck in communication-heavy workloads such as large-scale AI training and HPC. The H100-SXM also provides HBM3e memory with up to 3.35 TB/s of bandwidth, compared to 2 TB/s with the H100-PCIe’s HBM3, improving performance in memory-bound tasks. Additionally, the H100-SXM’s 700W TDP allows higher sustained clock speeds and throughput, while the H100-PCIe’s 300–350W TDP imposes stricter performance limits. Overall, the H100-SXM is the optimal choice for high-communication, multi-GPU workloads, whereas the H100-PCIe offers more flexibility for less communication-intensive applications.