HomeComputeGPU InstancesReference Content
GPU time-slicing with Kubernetes
Jump toUpdate content

NVIDIA GPU time-slicing with Kubernetes

Reviewed on 25 April 2023Published on 25 March 2022

NVIDIA GPUs are powerful hardware commonly used for model training, deep learning, scientific simulations, and data processing tasks. On the other hand, Kubernetes (K8s) is a container orchestration platform that helps manage and deploy containerized applications.

Time-slicing in the context of NVIDIA GPUs and Kubernetes refers to sharing a physical GPU among multiple containers or pods in a Kubernetes cluster.

The technology involves partitioning the GPU’s processing time into smaller intervals and allocating those intervals to different containers or pods. This technique allows multiple workloads to run on the same physical GPU, effectively sharing its resources while providing isolation between the different workloads.

Operational procedures of GPU time-slicing in Kubernetes

Time-slicing NVIDIA GPUs with Kubernetes involves:

  • Dynamically allocating and sharing GPU resources among multiple containers or pods in a cluster.
  • Allowing each pod or container to use the GPU for a specific time interval before switching to another.
  • Efficiently using the available GPU capacity.

This allows multiple workloads to utilize the GPU by taking turns in rapid succession.

  • GPU sharing: Time-slicing involves sharing a single GPU among containers or pods by allocating small time intervals. Sharing is achieved by rapidly switching between different containers or pods, allowing them to use the GPU for a short duration before moving on to the next workload.
  • GPU context switching: Refers to saving one workload’s state, loading another’s, and resuming processing. Modern GPUs are designed to handle context switching efficiently.

Management of GPU time-slicing within the Kubernetes cluster

Several elements within the Kubernetes cluster oversee the time-slicing of GPUs:

  • GPU scheduling: Kubernetes employs a scheduler that determines which containers or pods get access to GPUs and when. This scheduling is based on resource requests, limits, and the available GPUs on the nodes in the cluster.
  • GPU device plugin: Kubernetes uses the NVIDIA GPU device plugin to expose the GPUs available on each node to the cluster’s scheduler. This plugin helps the scheduler make informed decisions about GPU allocation.
  • Container GPU requests and limits: When defining a container or pod in Kubernetes, you can specify GPU requests and limits. The requests represent the minimum required GPU resources, while the limits define the maximum allowed GPU usage. These values guide the Kubernetes scheduler in making placement decisions.

Time-slicing compared to MIG

The most recent versions of NVIDIA GPUs introduce Multi-instance GPU (MIG) mode. Fully integrated into Kubernetes in 2020, MIG allows a single GPU to be partitioned into smaller, predefined instances, essentially resembling miniaturized GPUs. These instances provide memory and fault isolation directly at the hardware level. Instead of using the entire native GPU, you can run workloads on one of these predefined instances, enabling shared GPU access.

Kubernetes GPU time-slicing divides the GPU resources at the container level within a Kubernetes cluster. Multiple containers (pods) share a single GPU, whereas MIG divides the GPU resources at the hardware level. Each MIG instance behaves like a separate GPU. While time-slicing facilitates shared GPU access across a broader user spectrum, it comes with a trade-off. It sacrifices the memory and fault isolation advantages inherent to MIG. Additionally, it presents a solution to enable shared GPU access on earlier GPU generations lacking MIG support. Combining MIG and time-slicing is feasible to expand the scope of shared access to MIG instances.

For more information and examples about NVIDIA GPUs time-slicing using Kubernetes, refer to the official documentation.


Using time-slicing for GPUs with Kubernetes could bring overhead due to context-switching, potentially affecting GPU-intensive operations’ performance. This means that this strategy suits best when strict isolation is not necessary and workloads do not strongly depend on extended GPU usage.