NavigationContentFooter
Suggest an edit

How to use scratch storage with H100 GPUs and Kubernetes Kapsule

Reviewed on 27 December 2023Published on 27 December 2023
Security & Identity (IAM)

To perform certain actions described below, you must either be the Owner of the Organization in which the actions will be performed or an IAM user with the necessary permissions.

Requirements
  • You have an account and are logged into the Scaleway console
  • You have created a Kubernetes Kapsule or Kosmos cluster
  • Your cluster is using H100 GPU Instances

Scratch storage is ephemeral storage allocated to H100 GPU Instances (up to 6 TB) and crucial for handling temporary data in high-performance computing environments, especially when working with GPUs like the H100. Kubernetes, a powerful container orchestration platform, can be configured to make optimal use of scratch storage for improved performance.

Its utility spans multiple use cases, including but not limited to:

  • Large dataset storage for training: H100 scratch storage provides a local, high-performance storage option for storing large datasets required for training machine learning models. This can significantly enhance the efficiency of training workflows by reducing data transfer times.
  • Checkpointing during training: The scratch volume allows for the seamless saving of checkpoints during model training. This is important for resuming training processes in case of interruptions or for exploring different model iterations without starting from the beginning.
  • Temporary storage for preprocessing: For data preprocessing tasks that demand high I/O throughput and low latency, H100 scratch storage offers a temporary space on the host node’s filesystem. This can be advantageous for quick and efficient preprocessing steps in complex data pipelines.
  • Experimentation and output evaluation: Experimenting with different configurations and hyperparameters often involves iterative processes. H100 scratch storage makes quick storage and retrieval of intermediary results easy, enabling efficient experimentation and evaluation of model outputs.

Design your workloads or applications to take advantage of the fast and temporary nature of Scratch storage. Consider configuring your applications to store intermediate or temporary data in the Scratch storage path. In Kubernetes, H100 scratch storage is implemented as a directory on the host node’s filesystem. This scratch volume is made available to containers within a Pod using the hostPath volume type, allowing direct integration with Kubernetes workloads.

Configuring H100 scratch storage in Kubernetes involves specifying the hostPath volume type in the Pod’s volume definition. The following is an example configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-awesome-deployment
spec:
replicas: 1
selector:
matchLabels:
app: my-awesome-deployment
template:
metadata:
labels:
app: my-awesome-deployment
spec:
containers:
- name: log-generator
image: busybox
args: ["/bin/sh", "-c", "while true; do date >> /scratch/logs/test.log; sleep 10; done"]
volumeMounts:
- mountPath: /scratch/logs
name: scratch-volume
volumes:
- name: scratch-volume
hostPath:
path: /scratch

In this example, the scratch volume is mounted at /scratch/logs within the container.

Important

Note that H100 scratch storage is ephemeral, meaning that the data is not permanently stored. If a node crashes or is replaced, the data stored in the scratch volume will be wiped. Therefore, it is crucial to use H100 scratch storage carefully based on the specific requirements of the workload.

H100 scratch storage in Kubernetes provides a valuable tool for handling diverse data-intensive tasks efficiently. By understanding its features and limitations, developers and data scientists can use the capabilities of H100 scratch storage to enhance the performance and scalability of their Kubernetes workloads.

See also
How to use the NVIDIA GPU operator on Kapsule and Kosmos with GPU InstancesHow to delete a cluster
Cloud Products & Resources
  • Scaleway Console
  • Compute
  • Storage
  • Network
  • IoT
  • AI
Dedicated Products & Resources
  • Dedibox Console
  • Dedibox Servers
  • Network
  • Web Hosting
Scaleway
  • Scaleway.com
  • Blog
  • Careers
  • Scaleway Learning
Follow us
FacebookTwitterSlackInstagramLinkedin
ContractsLegal NoticePrivacy PolicyCookie PolicyDocumentation license
© 1999-2024 – Scaleway SAS