Monitor GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter
This tutorial guides you through the process of monitoring your GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter. Visualize your GPU Instances' metrics and ensure optimal performance and usage of your resources.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- Owner status or IAM permissions allowing you to perform actions in the intended Organization
- Created a GPU Instance
- Connected to your Instance via SSH
- Installed Docker Engine and Docker Compose on your GPU Instance.
Create a Cockpit data source and credentials
Create a Cockpit data source
We are creating a Cockpit data source because your GPU Instance's metrics will be stored in it and the exporter agent needs data source configuration information to then export your Instance's metrics.
-
Create a metrics custom data source in Cockpit. For the sake of this tutorial, we will name it
gpu-instance-metrics
. -
Click your metrics data source to view information such as its URL and push path.
Create a token
-
Create a Cockpit token from the Scaleway console.
-
Select a region for the data source.
-
Tick the Push Metrics box and click Create token to confirm.
Collect metrics from your GPU Instance
Install the NVIDIA DCGM Exporter, node exporter and Grafana Alloy agent on your GPU Instance
-
Copy and paste the following command to create a configuration file named
config.alloy
in your Instance:touch config.alloy
-
Copy and paste the following template inside
config.alloy
:prometheus.remote_write "cockpit" { endpoint { url = "https://example-afc6-4d02-a2fd-bc020bbaa7d0.metrics.cockpit.fr-par.scw.cloud/api/v1/push" headers = { "X-TOKEN" = "example_bKNpXZZP6BSKiYzV8fiQL1yR_kP_VLB-h0tpYAkaNoVTHVm8q", } } } prometheus.scrape "dcgm_exporter" { scrape_interval = "60s" targets = [{__address__ = "dcgm_exporter:9400"}] forward_to = [prometheus.remote_write.cockpit.receiver] } prometheus.exporter.unix "node_exporter" { set_collectors = [ "uname", "cpu", "cpufreq", "loadavg", "meminfo", "filesystem", "netdev", ] } prometheus.scrape "node_exporter" { scrape_interval = "60s" targets = prometheus.exporter.unix.node_exporter.targets forward_to = [prometheus.remote_write.cockpit.receiver] }
-
Replace the values of
cockpit.endpoint.url
(https://example-afc6-4d02-a2fd-bc020bbaa7d0.metrics.cockpit.fr-par.scw.cloud/api/v1/push
) andcockpit.endpoint.headers.X-TOKEN
(example_bKNpXZZP6BSKiYzV8fiQL1yR_kP_VLB-h0tpYAkaNoVTHVm8q
) with the ones of yourgpu-instance-metrics
Cockpit data source.This configuration allows you to:
- collect performance data (using
dcgm_exporter
) from your GPU Instance. This includes information like GPU load (how much of the GPU's processing power is being used), temperature, and other relevant metrics. - collect standard Instance metrics with
node_exporter
(CPU load, disk size, etc.) - push the collected data to your Cockpit data source (using
cockpit
).
- collect performance data (using
-
Copy and paste the following command to create a
docker-compose.yaml
file in your Instance:touch docker-compose.yaml
-
Copy and paste the following configuration inside
docker-compose.yaml
, save it and exit the file.services: dcgm_exporter: image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [ gpu ] cap_add: - SYS_ADMIN ports: - "9400:9400" agent: image: grafana/alloy:latest ports: - "12345:12345" volumes: - "./config.alloy:/etc/alloy/config.alloy" command: [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/config.alloy", ]
This configuration will:
- deploy the DCGM exporter
- deploy the Grafana Alloy agent
-
Run docker services using the following command:
docker compose up
Create Cockpit dashboards in Grafana
Create a GPU metrics dashboard
-
Access the Overview tab of your Cockpit and click Open dashboards to open your Cockpit dashboards in Grafana.
-
Click the + icon in the top-right-hand corner, then click Import dashboard.
-
Copy the ID (
12219
) of the Grafana NVIDIA DCGM Exporter dashboard and paste it in the Import via grafana.com field. -
Click Load.
-
Select your Prometheus data source named
gpu-instance-metrics
, then click Import
You should see your dashboard with data such as GPU Temperature or GPU Power Usage.
Create a CPU and disk metrics Cockpit dashboard in Grafana
-
Access the Overview tab of your Cockpit and click Open dashboards to open your Cockpit dashboards in Grafana.
-
Click the + icon in the top-right-hand corner, then click Import dashboard.
-
Copy the ID (
1860
) of the Node Exporter Full dashboard and paste it in the Import via grafana.com field. -
Click Load.
-
Select your Prometheus data source named
gpu-instance-metrics
, then click Import
You should now see your dashboard with data such as CPU usage and Memory Usage.
You can now find your newly created dashboards in your list of Cockpit dashboards in Grafana. This allows you to access your GPU Instances data to monitor and optimize your resources.
Going further
-
Add more metrics to your dashboards
- Connect to your GPU Instance via SSH
- Edit the
config.alloy
file and restart the agents using thedocker compose up
command - Update your Cockpit dashboards in Grafana
-
Create custom dashboards
- In Grafana explore the metrics you have sent by clicking the Menu icon on the left, then Explore.
- Select your custom data source named
gpu-instance-metrics
in the Datasource dropdown list located in the top-left-hand corner. - Click Metrics browser. You should see a list of metrics appear (for example,
DCGM_FI_DEV_GPU_TEMP
ornode_cpu_seconds_total
). - Write the desired query, click Run query to visualize data, and then Add to dashboard to add it to a new or existing dashboard.
Visit our Help Center and find the answers to your most frequent questions.
Visit Help Center