Tuesday, March 23, 2021

How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark

Goal:

How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark.

Env:

RAPIDS Accelerator for Apache Spark 0.4

Spark 3.1.1

Solution:

1. Compile RAPIDS Accelerator for Apache Spark

1.a Create a conda env for compiling

conda create -n cudftest -c conda-forge python=3.8 pytest pandas pyarrow sre_yield pytest-xdist findspark

Here I decide to use one conda env "cudftest" for compiling and use another conda env named "rapids-0.18" to test the cudf_udf in Spark.

Of course you can choose to use one conda env if you want but it may include too many python packages in the end. 

I just want to keep the conda env "rapids-0.18" to be as small as possible because eventually I need to distribute it to all Executors in Spark cluster.

1.b Compile from source code

cd ~/github/spark-rapids
# git checkout v0.4.0
mvn clean install -DskipTests

You can decide which version to compile. Here I am going to compile the 0.15-snapshot which is the current main branch. The current GA release is 0.4 though.

2. Run pandas cudf_udf Tests

Please follow this Doc on how to enable the pandas cudf_udf tests.

Basically pandas cudf_udf tests are inside "./integration_tests/runtests.py" with option "--cudf_udf".

The key is to make sure the all the python envs and needed jar file paths are correct.

2.a Create a conda env for running cudf_udf tests

Please follow the steps mentioned in rapids.ai to create the conda env with cudf installed.

For example:

conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
-c defaults cudf=0.18 python=3.7 cudatoolkit=11.0

2.b Install needed python packages needed by cudf_udf tests

conda activate rapids-0.18
conda install pandas

2.c Package your conda env

You can refer to this blog on how to package your conda env for spark job.

cd /home/xxx/miniconda3/envs
zip -r rapids-0.18.zip rapids-0.18/
mv rapids-0.18.zip ~/
cd ~/ && mkdir MYGLOBALENV
cd MYGLOBALENV/ && ln -s /home/xxx/miniconda3/envs/rapids-0.18/ rapids-0.18
cd ..
export PYSPARK_PYTHON=./MYGLOBALENV/rapids-0.18/bin/python

2.d Run the pandas cudf_udf tests

cd /home/xxx/github/spark-rapids/integration_tests 
PYSPARK_PYTHON=/home/xxx/MYGLOBALENV/rapids-0.18/bin/python $SPARK_HOME/bin/spark-submit --jars "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/github/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/spark/rapids/cudf.jar,/home/xxx/github/spark-rapids/tests/target/rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.rapids.memory.gpu.allocFraction=0.3 \
--conf spark.rapids.python.memory.gpu.allocFraction=0.3 \
--conf spark.rapids.python.concurrentPythonWorkers=2 \
--py-files "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.executorEnv.PYTHONPATH="/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.executorEnv.PYSPARK_PYTHON=/home/xxx/rapids-0.18/bin/python \
--archives /home/xxx/rapids-0.18.zip#MYGLOBALENV \
./runtests.py -m "cudf_udf" -v -rfExXs --cudf_udf

Note1:  Make sure all jar paths are correct.

Note2:  Here I am using spark standalone cluster, that is why I used spark.executorEnv.PYSPARK_PYTHON. For Spark on YARN, you need to use corresponding parameters such as spark.yarn.appMasterEnv.PYSPARK_PYTHON .

Note3: Make sure $SPARK_HOME is set and also the spark cluster is working fine with Rapids for Spark enabled.

The expected result is: PASSED [100%]. 

Reference:

http://alkaline-ml.com/2018-07-02-conda-spark/


1 comment:


  1. In today's digital age, finding reliable sources of information is crucial for making informed decisions. Many websites offer diverse content to cater to various interests, and one such platform is zizodtf. It provides users with up-to-date news, insights, and resources that can enhance their knowledge and understanding of current trends. Exploring sites like this can significantly benefit those seeking accurate and comprehensive data online.

    ReplyDelete

Popular Posts