Thursday, April 8, 2021

How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator

Goal:

This article explains how to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator.

Env:

Spark 3.1.1 (Standalone Cluster)

RAPIDS Accelerator for Apache Spark 0.5 snapshot

cuDF jar 0.19 snapshot

Solution:

1. Build the cuDF JARs with USE_NVTX option on.

Follow Doc: https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html

Note: Starting from cuDF 0.19, the USE_NVTX(NVIDIA Tools Extension) is on by default as per this PR so we do not need to build jar any more.  It means in the future cuDF release(>=0.19) we can skip this step.

So here in this test, I just used the latest cuDF 0.19 snapshot jar and Rapids Accelerator 0..5 snapshot jar(built from source code manually) together. Note: these 2 jars are not stable releases.

2.  Download nsight systems on your client machine

https://developer.nvidia.com/nsight-systems

Here I downloaded and installed on Mac where I will view the metrics later.

3. Make sure target machine has nsys installed and meet requirements.

Especially make sure the "Requirement" is met. Such as:
Use of Linux Perf: To collect thread scheduling data and IP (instruction pointer) samples, the Perf paranoid level on the target system must be 2 or less. 
You can use "nsys status -e" to check the current status:
$  nsys status -e

Sampling Environment Check
Linux Kernel Paranoid Level = 3: Fail
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
Sampling Environment: Fail

See the product documentation for more information.
If the Kernel Paranoid Level check failed, then we can use below commands to check and enable it:
$ cat /proc/sys/kernel/perf_event_paranoid
3
$ sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'
$ cat /proc/sys/kernel/perf_event_paranoid
2
$ sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf'
$ nsys status -e

Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK
Note: there are other requirements like kernel version, glibc version, supported CUDA version. Please refer to above documentation.

4. Add extra java options in both driver and executor.

--conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true

You can consider putting those into spark-defaults.conf or specifying them each time for spark-shell/spark-sql/etc.

If you have other extraJavaOption(s), do not forget to append them.

5. Start spark-shell using "nsys profile"

nsys profile bash -c " \
CUDA_VISIBLE_DEVICES=0 ${SPARK_HOME}/sbin/start-slave.sh $master_url & \
$SPARK_HOME/bin/spark-shell; \
${SPARK_HOME}/sbin/stop-slave.sh"

6. Run some query

When quitting spark-shell, it will generate a *.qdrep file in current directory.

For example:

scala> :quit
:quit
stopping org.apache.spark.deploy.history.HistoryServer
stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdrep"
Report file moved to "/home/xxx/report1.qdrep"

7. Use "nsys stat" command on the target machine to check the report

You can choose to use "nsys stat" command on the target machine to check the report or use following GUI option.
"nsys stat" can show the CUDA API summary, GPU Kernel summary, GPU Memory time summary, NVTX push-pop range summary, etc:
$  nsys stats report8.qdrep
Using report8.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/cudaapisum report8.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
------- --------------- --------- --------------- ------------- ------------- --------------------------
66.8 152,391,401,099 192,250 792,673.1 679 18,448,141 cudaStreamSynchronize_ptsz
31.2 71,169,590,822 114,830 619,782.2 195 9,667,534 cudaMemcpyAsync_ptsz
0.7 1,565,365,626 7 223,623,660.9 3,454 1,565,334,856 cudaFree
0.5 1,117,531,408 65,671 17,017.1 3,496 131,888 cudaLaunchKernel_ptsz
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum report8.sqlite] to console...

Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ------------ ---------- ---------- ----------------------------------------------------------------------------------------------------
37.5 83,645,234,788 14,576 5,738,558.9 5,554,755 6,897,949 void (anonymous namespace)::scatter_kernel<int, (anonymous namespace)::boolean_mask_filter<false>, …
28.2 62,805,133,776 7,288 8,617,608.9 8,459,988 8,955,404 void cudf::binops::jit::kernel_v_v<bool, int, int, cudf::binops::jit::Greater>(int, bool*, int*, in…
18.8 41,854,794,778 7,288 5,742,974.0 5,634,787 5,984,609 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
8.7 19,342,375,816 7,289 2,653,639.2 2,575,613 2,869,850 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemtimesum report8.sqlite] to console...

Time(%) Total Time (ns) Operations Average Minimum Maximum Operation
------- --------------- ---------- -------- ------- ------- ------------------
47.8 78,733,508 82,908 949.6 608 610,013 [CUDA memcpy DtoH]
35.7 58,761,119 80,174 732.9 640 13,792 [CUDA memset]
16.4 26,979,351 31,900 845.7 671 662,844 [CUDA memcpy HtoD]
0.1 136,064 8 17,008.0 1,632 32,640 [CUDA memcpy DtoD]
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemsizesum report8.sqlite] to console...

Total Operations Average Minimum Maximum Operation
---------- ---------- --------- ------- --------- ------------------
37,577.836 31,900 1.178 0.004 7,813.324 [CUDA memcpy HtoD]
32,226.750 8 4,028.344 244.187 7,812.500 [CUDA memcpy DtoD]
24,145.266 82,908 0.291 0.001 7,812.500 [CUDA memcpy DtoH]
16,326.898 80,174 0.204 0.001 7,812.500 [CUDA memset]
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum report8.sqlite] to console...

Time(%) Total Time (ns) Instances Average Minimum Maximum Range
------- --------------- --------- --------------- ------------- ------------- -------------------------------
41.0 209,116,856,965 10,002 20,907,504.2 117,938 23,476,086 libcudf:apply_boolean_mask
41.0 209,039,719,367 10,002 20,899,792.0 116,416 23,467,375 libcudf:copy_if
16.7 85,273,533,436 10,000 8,527,353.3 8,431,597 13,684,934 libcudf:cross_join
...

8. Copy the *.qdrep to the client machine where nsight systems is installed.

Open the *.qdrep using nsight systems.

My query in above #5 is a cross-join which takes around 6mins.

Normally I will firstly "Analysis Summary" tab to get the PID of Spark Executor(24897) which would be my focus.



Then move to "Timeline view" tab and identify Spark Executor process:

As we can see the CUDA HW(GPU) is showing busy(blue) for most of the time.

If we hover mouse on it, it can show you the CUDA Kernel running% at that time:

We can dig further into all threads of Spark Executor process, and we can identify the Executor Task 1 thread keeps calling CUDA API during that time. 

And most importantly, here the "libcudf" and "NVTX(libcudf)" rows will show up.  

Note:They will NOT show up if "NVTX" is not switched on when building cuDF jar.

Here "libcudf" row shows "cross_join" which match our query type.

 "NVTX(libcudf)" row shows similar things under "CUDA HW" section:


Tips:

1. One useful tip is to pin the related rows and compare:

After those rows got pinned, if you scroll down/up, they will always be on top or at bottom, such as:

2. Change the time from "session time" to "global time"

After that, it will show machine time which can help you match the real world time.

3.  How to start/stop collection manually

We can firstly "nsys launch" the Spark worker/slave, and then use "nsys start" and "nsys stop" to control the collection window manually.
a.  Stop spark slaves manually
${SPARK_HOME}/sbin/stop-slave.sh
b. Start spark slaves using "nsys launch"
nsys launch bash -c "CUDA_VISIBLE_DEVICES=0 $SPARK_HOME/sbin/start-slave.sh spark://$HOSTNAME:7077 &"
c. Open another terminal session, run "nsys start"
$  nsys start
$ nsys sessions list
ID TIME STATE LAUNCH NAME
1028142 00:51 Collecting 1 [default]
d. Run a Spark job using either spark-shell or spark-submit  or something else.
e. Run "nsys stop" after the Spark job completes
$  nsys stop
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-4026-c2c5-8a18-5372.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-4026-c2c5-8a18-5372.qdrep"
Report file moved to "/home/xxx/report10.qdrep"
stop executed
f. You can start&stop more collection windows.
g. Stop Spark-worker in the end.

References:

 

 

 

 

 

 

No comments:

Post a Comment

Popular Posts