Goal:

This article explains how to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator.

Env:

Spark 3.1.1 (Standalone Cluster)

RAPIDS Accelerator for Apache Spark 0.5 snapshot

cuDF jar 0.19 snapshot

Solution:

1. Build the cuDF JARs with USE_NVTX option on.

Follow Doc: https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html

Note: Starting from cuDF 0.19, the USE_NVTX(NVIDIA Tools Extension) is on by default as per this PR so we do not need to build jar any more. It means in the future cuDF release(>=0.19) we can skip this step.

So here in this test, I just used the latest cuDF 0.19 snapshot jar and Rapids Accelerator 0..5 snapshot jar(built from source code manually) together. Note: these 2 jars are not stable releases.

2. Download nsight systems on your client machine

https://developer.nvidia.com/nsight-systems

Here I downloaded and installed on Mac where I will view the metrics later.

3. Make sure target machine has nsys installed and meet requirements.

Please refer to https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html for details.

Especially make sure the "Requirement" is met. Such as:

Use of Linux Perf: To collect thread scheduling data and IP (instruction pointer) samples, the Perf paranoid level on the target system must be 2 or less.

You can use "nsys status -e" to check the current status:

$  nsys status -e

Sampling Environment Check
Linux Kernel Paranoid Level = 3: Fail
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
Sampling Environment: Fail

See the product documentation for more information.

If the Kernel Paranoid Level check failed, then we can use below commands to check and enable it:

$ cat /proc/sys/kernel/perf_event_paranoid
3
$ sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'
$ cat /proc/sys/kernel/perf_event_paranoid
2
$ sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf'
$ nsys status -e

Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK

Note: there are other requirements like kernel version, glibc version, supported CUDA version. Please refer to above documentation.

4. Add extra java options in both driver and executor.

--conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true

You can consider putting those into spark-defaults.conf or specifying them each time for spark-shell/spark-sql/etc.

If you have other extraJavaOption(s), do not forget to append them.

5. Start spark-shell using "nsys profile"

nsys profile bash -c " \
CUDA_VISIBLE_DEVICES=0 ${SPARK_HOME}/sbin/start-slave.sh $master_url & \
$SPARK_HOME/bin/spark-shell; \
${SPARK_HOME}/sbin/stop-slave.sh"

6. Run some query

When quitting spark-shell, it will generate a *.qdrep file in current directory.

For example:

scala> :quit
:quit
stopping org.apache.spark.deploy.history.HistoryServer
stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdrep"
Report file moved to "/home/xxx/report1.qdrep"

7. Use "nsys stat" command on the target machine to check the report

You can choose to use "nsys stat" command on the target machine to check the report or use following GUI option.

"nsys stat" can show the CUDA API summary, GPU Kernel summary, GPU Memory time summary, NVTX push-pop range summary, etc:

$  nsys stats report8.qdrep
Using report8.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/cudaapisum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Num Calls      Average         Minimum        Maximum                Name
 -------  ---------------  ---------  ---------------  -------------  -------------  --------------------------
    66.8  152,391,401,099    192,250        792,673.1            679     18,448,141  cudaStreamSynchronize_ptsz
    31.2   71,169,590,822    114,830        619,782.2            195      9,667,534  cudaMemcpyAsync_ptsz
     0.7    1,565,365,626          7    223,623,660.9          3,454  1,565,334,856  cudaFree
     0.5    1,117,531,408     65,671         17,017.1          3,496        131,888  cudaLaunchKernel_ptsz
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
    37.5   83,645,234,788     14,576   5,738,558.9   5,554,755   6,897,949  void (anonymous namespace)::scatter_kernel<int, (anonymous namespace)::boolean_mask_filter<false>, …
    28.2   62,805,133,776      7,288   8,617,608.9   8,459,988   8,955,404  void cudf::binops::jit::kernel_v_v<bool, int, int, cudf::binops::jit::Greater>(int, bool*, int*, in…
    18.8   41,854,794,778      7,288   5,742,974.0   5,634,787   5,984,609  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
     8.7   19,342,375,816      7,289   2,653,639.2   2,575,613   2,869,850  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemtimesum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Operations  Average   Minimum  Maximum      Operation
 -------  ---------------  ----------  --------  -------  -------  ------------------
    47.8       78,733,508      82,908     949.6      608  610,013  [CUDA memcpy DtoH]
    35.7       58,761,119      80,174     732.9      640   13,792  [CUDA memset]
    16.4       26,979,351      31,900     845.7      671  662,844  [CUDA memcpy HtoD]
     0.1          136,064           8  17,008.0    1,632   32,640  [CUDA memcpy DtoD]
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemsizesum report8.sqlite] to console...

   Total     Operations   Average   Minimum   Maximum       Operation
 ----------  ----------  ---------  -------  ---------  ------------------
 37,577.836      31,900      1.178    0.004  7,813.324  [CUDA memcpy HtoD]
 32,226.750           8  4,028.344  244.187  7,812.500  [CUDA memcpy DtoD]
 24,145.266      82,908      0.291    0.001  7,812.500  [CUDA memcpy DtoH]
 16,326.898      80,174      0.204    0.001  7,812.500  [CUDA memset]
 ...
 Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Instances      Average         Minimum        Maximum                  Range
 -------  ---------------  ---------  ---------------  -------------  -------------  -------------------------------
    41.0  209,116,856,965     10,002     20,907,504.2        117,938     23,476,086  libcudf:apply_boolean_mask
    41.0  209,039,719,367     10,002     20,899,792.0        116,416     23,467,375  libcudf:copy_if
    16.7   85,273,533,436     10,000      8,527,353.3      8,431,597     13,684,934  libcudf:cross_join
...

8. Copy the *.qdrep to the client machine where nsight systems is installed.

Open the *.qdrep using nsight systems.

My query in above #5 is a cross-join which takes around 6mins.

Normally I will firstly "Analysis Summary" tab to get the PID of Spark Executor(24897) which would be my focus.

Then move to "Timeline view" tab and identify Spark Executor process:

As we can see the CUDA HW(GPU) is showing busy(blue) for most of the time.

If we hover mouse on it, it can show you the CUDA Kernel running% at that time:

We can dig further into all threads of Spark Executor process, and we can identify the Executor Task 1 thread keeps calling CUDA API during that time.

And most importantly, here the "libcudf" and "NVTX(libcudf)" rows will show up.

Note:They will NOT show up if "NVTX" is not switched on when building cuDF jar.

Here "libcudf" row shows "cross_join" which match our query type.

"NVTX(libcudf)" row shows similar things under "CUDA HW" section:

Tips:

1. One useful tip is to pin the related rows and compare:

After those rows got pinned, if you scroll down/up, they will always be on top or at bottom, such as:

2. Change the time from "session time" to "global time"

After that, it will show machine time which can help you match the real world time.

3. How to start/stop collection manually

We can firstly "nsys launch" the Spark worker/slave, and then use "nsys start" and "nsys stop" to control the collection window manually.

a. Stop spark slaves manually

${SPARK_HOME}/sbin/stop-slave.sh

b. Start spark slaves using "nsys launch"

nsys launch bash -c "CUDA_VISIBLE_DEVICES=0 $SPARK_HOME/sbin/start-slave.sh spark://$HOSTNAME:7077 &"

c. Open another terminal session, run "nsys start"

$  nsys start
$  nsys sessions list
              ID         TIME                STATE LAUNCH NAME
         1028142        00:51           Collecting      1 [default]

d. Run a Spark job using either spark-shell or spark-submit or something else.

e. Run "nsys stop" after the Spark job completes

$  nsys stop
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-4026-c2c5-8a18-5372.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-4026-c2c5-8a18-5372.qdrep"
Report file moved to "/home/xxx/report10.qdrep"
stop executed

f. You can start&stop more collection windows.

g. Stop Spark-worker in the end.

Thursday, April 8, 2021

How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator