Suggest an edit

How to use scratch storage with H100 GPUs and Kubernetes Kapsule

Reviewed on 27 December 2023Published on 27 December 2023

Scratch storage is ephemeral storage allocated to H100 GPU Instances (up to 6 TB) and crucial for handling temporary data in high-performance computing environments, especially when working with GPUs like the H100. Kubernetes, a powerful container orchestration platform, can be configured to make optimal use of scratch storage for improved performance.

Its utility spans multiple use cases, including but not limited to:

  • Large dataset storage for training: H100 scratch storage provides a local, high-performance storage option for storing large datasets required for training machine learning models. This can significantly enhance the efficiency of training workflows by reducing data transfer times.
  • Checkpointing during training: The scratch volume allows for the seamless saving of checkpoints during model training. This is important for resuming training processes in case of interruptions or for exploring different model iterations without starting from the beginning.
  • Temporary storage for preprocessing: For data preprocessing tasks that demand high I/O throughput and low latency, H100 scratch storage offers a temporary space on the host node’s filesystem. This can be advantageous for quick and efficient preprocessing steps in complex data pipelines.
  • Experimentation and output evaluation: Experimenting with different configurations and hyperparameters often involves iterative processes. H100 scratch storage makes quick storage and retrieval of intermediary results easy, enabling efficient experimentation and evaluation of model outputs.

Design your workloads or applications to take advantage of the fast and temporary nature of Scratch storage. Consider configuring your applications to store intermediate or temporary data in the Scratch storage path.

Before you start

To complete the actions presented below, you must have:

  • A Scaleway account logged into the console
  • Owner status or IAM permissions allowing you to perform actions in the intended Organization
  • Created a Kubernetes Kapsule or Kosmos cluster that uses H100 GPU Instances

In Kubernetes, H100 scratch storage is implemented as a directory on the host node’s filesystem. This scratch volume is made available to containers within a Pod using the hostPath volume type, allowing direct integration with Kubernetes workloads.

Configuring H100 scratch storage in Kubernetes involves specifying the hostPath volume type in the Pod’s volume definition. The following is an example configuration:

apiVersion: apps/v1
kind: Deployment
name: my-awesome-deployment
replicas: 1
app: my-awesome-deployment
app: my-awesome-deployment
- name: log-generator
image: busybox
args: ["/bin/sh", "-c", "while true; do date >> /scratch/logs/test.log; sleep 10; done"]
- mountPath: /scratch/logs
name: scratch-volume
- name: scratch-volume
path: /scratch

In this example, the scratch volume is mounted at /scratch/logs within the container.


Note that H100 scratch storage is ephemeral, meaning that the data is not permanently stored. If a node crashes or is replaced, the data stored in the scratch volume will be wiped. Therefore, it is crucial to use H100 scratch storage carefully based on the specific requirements of the workload.

H100 scratch storage in Kubernetes provides a valuable tool for handling diverse data-intensive tasks efficiently. By understanding its features and limitations, developers and data scientists can use the capabilities of H100 scratch storage to enhance the performance and scalability of their Kubernetes workloads.

See also
How to use the NVIDIA GPU operator on Kapsule and Kosmos with GPU InstancesHow to deploy x86 and ARM images in Kubernetes
Docs APIScaleway consoleDedibox consoleScaleway LearningScaleway.comPricingBlogCarreer
© 2023-2024 – Scaleway