HomeComputeGPU InstancesTroubleshooting
Fixing GPU issues after upgrading GPU Instances with cloud-init

Update content

Fixing GPU issues after upgrading GPU Instances with cloud-init

Published on 19 September 2022

When running the upgrade_package command with cloud-init the Nvidia drivers may break causing the GPU to become unavailable. This problem persists even after a manual reboot of the Instance and the following error displays:

The GPU is not usable. The driver not loaded, with the following errors in the system journal

[FAILED] Failed to start NVIDIA Persistence Daemon.

This error is caused by the way the upgrade_package command handles the upgrade of the packages installed on the system. It runs apt-get dist-upgrade instead of apt upgrade.

To avoid this issue, use the following cloud-init script with your GPU Instances:

#cloud-config

system_info:
apt_get_upgrade_subcommand: "upgrade"

# Upgrade the instance on first boot and reboot if needed
package_upgrade: true
package_reboot_if_required: true