Monday, April 12, 2021

How to use NVIDIA Nsight Systems to profile a Spark on K8s job with Rapids Accelerator

Goal:

This article explains how to use NVIDIA Nsight Systems to profile a Spark on K8s job with Rapids Accelerator.

This is a follow-up blog after How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator

Env:

Spark 3.1.1 (on Kubernetes)

RAPIDS Accelerator for Apache Spark 0.5 snapshot

cuDF jar 0.19 snapshot

Solution:

Please read How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator blog and also Getting Started with RAPIDS and Kubernetes doc firstly.

This blog will mainly focus on differences for Spark on Kubernetes job.

1. Spark side

As we know, "nsys profile" should target a Spark Executor process. So the key is to find out how does Spark start an Executor in a Kubernetes cluster.

Basically it is resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh

  executor)
shift 1
CMD=(
${JAVA_HOME}/bin/java
"${SPARK_EXECUTOR_JAVA_OPTS[@]}"
-Xms$SPARK_EXECUTOR_MEMORY
-Xmx$SPARK_EXECUTOR_MEMORY
-cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url $SPARK_DRIVER_URL
--executor-id $SPARK_EXECUTOR_ID
--cores $SPARK_EXECUTOR_CORES
--app-id $SPARK_APPLICATION_ID
--hostname $SPARK_EXECUTOR_POD_IP
--resourceProfileId $SPARK_RESOURCE_PROFILE_ID
)
...

# Execute the container CMD under tini for better hygiene
exec /usr/bin/tini -s -- "${CMD[@]}"

So we just need to change the CMD part to add "nsys profile" before that. 

Such as:

  executor)
shift 1
CMD=(
nsys profile -o /some_persistent_storage/test_%h_%p.qdrep
${JAVA_HOME}/bin/java
"${SPARK_EXECUTOR_JAVA_OPTS[@]}"
-Xms$SPARK_EXECUTOR_MEMORY
-Xmx$SPARK_EXECUTOR_MEMORY
-cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url $SPARK_DRIVER_URL
--executor-id $SPARK_EXECUTOR_ID
--cores $SPARK_EXECUTOR_CORES
--app-id $SPARK_APPLICATION_ID
--hostname $SPARK_EXECUTOR_POD_IP
--resourceProfileId $SPARK_RESOURCE_PROFILE_ID
)
;;

Here we specified the output file to a persistent storage path which can be mounted in the docker container. 

"%h" means hostname and "%p" means PID.  For more details please refer to Nsight System user guide.

2. Docker image side

If you are using the Dockerfile.cuda , it actuall uses nvidia/cuda:10.1-devel-ubuntu18.04 as the base image. However this base image does not have Nsight Systems installed.

You need to either use your own base image which has Nsight Systems installed or adding the installation script into Dockerfile.cuda.

Below is one example to install Nsight Systems from CUDA 11.0.3 repo:

# Install Nsight-systems
RUN apt install -y wget && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
RUN dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
RUN apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
RUN apt-get update && apt-get install -y nsight-systems-2020.4.3

3. Build&upload the Docker Image and Run the Spark on K8s Job

The rest steps are the same as  Getting Started with RAPIDS and Kubernetes doc.

 

 

 


 

 

 

 

 

 

 

 

===

No comments:

Post a Comment

Popular Posts