How to use scratch storage with H100 and L40S GPUs and Kubernetes Kapsule

Reviewed on 08 July 2024 • Published on 27 December 2023

Scratch storage is ephemeral storage allocated to H100 and L40S GPU Instances (up to 6 TB) and crucial for handling temporary data in high-performance computing environments, especially when working with GPUs like the H100. Kubernetes, a powerful container orchestration platform, can be configured to make optimal use of scratch storage for improved performance.

Its utility spans multiple use cases, including but not limited to:

Large dataset storage for training: H100 and L40S scratch storage provides a local, high-performance storage option for storing large datasets required for training machine learning models. This can significantly enhance the efficiency of training workflows by reducing data transfer times.
Checkpointing during training: The scratch volume allows for the seamless saving of checkpoints during model training. This is important for resuming training processes in case of interruptions or for exploring different model iterations without starting from the beginning.
Temporary storage for preprocessing: For data preprocessing tasks that demand high I/O throughput and low latency, H100 and L40S scratch storage offers a temporary space on the host node’s file system. This can be advantageous for quick and efficient preprocessing steps in complex data pipelines.
Experimentation and output evaluation: Experimenting with different configurations and hyperparameters often involves iterative processes. H100 and L40S scratch storage make quick storage and retrieval of intermediary results easy, enabling efficient experimentation and evaluation of model outputs.

Design your workloads or applications to take advantage of the fast and temporary nature of Scratch storage. Consider configuring your applications to store intermediate or temporary data in the Scratch storage path.

Before you start

To complete the actions presented below, you must have:

A Scaleway account logged into the console
Owner status or IAM permissions allowing you to perform actions in the intended Organization
Created a Kubernetes Kapsule or Kosmos cluster that uses H100 and L40S GPU Instances

In Kubernetes, H100 and L40S scratch storage is implemented as a directory on the host node’s file system. This scratch volume is made available to containers within a pod using the hostPath volume type, allowing direct integration with Kubernetes workloads.

Configuring H100 and L40S scratch storage in Kubernetes involves specifying the hostPath volume type in the pod’s volume definition. The following is an example configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-awesome-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-awesome-deployment
  template:
    metadata:
      labels:
        app: my-awesome-deployment
    spec:
      containers:
        - name: log-generator
          image: busybox
          args: ["/bin/sh", "-c", "while true; do date >> /scratch/logs/test.log; sleep 10; done"]
          volumeMounts:
            - mountPath: /scratch/logs
              name: scratch-volume
      volumes:
        - name: scratch-volume
          hostPath:
            path: /scratch

In this example, the scratch volume is mounted at /scratch/logs within the container.

Important

Note that H100 and L40S scratch storage is ephemeral, meaning that the data is not permanently stored. If a node crashes or is replaced, the data stored in the scratch volume will be wiped. Therefore, it is crucial to use H100 and L40S scratch storage carefully based on the specific requirements of the workload.

H100 and L40S scratch storage in Kubernetes provides a valuable tool for handling diverse data-intensive tasks efficiently. By understanding its features and limitations, developers and data scientists can use the capabilities of H100 and L40S scratch storage to enhance the performance and scalability of their Kubernetes workloads.