Saturday, March 6, 2021

How to monitor NVIDIA GPU performance metrics

Goal:

This article shares how to monitor NVIDIA GPU performance metrics when running a job. 

Most important metrics include GPU%, Memory%, and inbound/outbound PCIe throughput.

Env:

Ubuntu 18.04

Quadro RTX 6000

Solution:

If we are running a Spark on GPU job, how do we monitor the NVIDIA GPU performance?

nvidia-smi has several options can achieve that goal.

I just ran below 2 options commands at the same time when a test job is running.

Both of them capture metrics every 1 second for that GPU with index=0.

1. Option 1

nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv  -l 1 -i 0 

Sample output:

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB
27 %, 0 %, 24220 MiB, 23953 MiB, 267 MiB
57 %, 0 %, 24220 MiB, 23989 MiB, 231 MiB
29 %, 1 %, 24220 MiB, 23941 MiB, 279 MiB

2. Option 2

nvidia-smi dmon -i 0 -s mutc -d 1 -o TD

Sample output:

#Date       Time        gpu    fb  bar1    sm   mem   enc   dec rxpci txpci  mclk  pclk
#YYYYMMDD HH:MM:SS Idx MB MB % % % % MB/s MB/s MHz MHz
20210306 22:58:19 0 11 4 0 0 0 0 0 0 405 300
20210306 22:58:20 0 271 9 30 0 0 0 632 1506 6500 1440
20210306 22:58:21 0 231 9 63 1 0 0 11184 1489 6500 2010
20210306 22:58:22 0 279 9 32 1 0 0 2721 2768 6500 2010

"fb" stands for On-board frame buffer memory which is so called device memory. And it matches above "utilization.memory" in option 1.

"SM" stands for Streaming Multiprocessor which matches above "utilization.gpu" in option 1(with a little time gap).

"rxpci txpci" means PCIe Rx and Tx Throughput in MB/s.

Please refer to "man nvidia-smi" for more options.







No comments:

Post a Comment

Popular Posts