Goal:
How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark.
Env:
RAPIDS Accelerator for Apache Spark 0.4
Spark 3.1.1
Solution:
1. Compile RAPIDS Accelerator for Apache Spark
1.a Create a conda env for compiling
conda create -n cudftest -c conda-forge python=3.8 pytest pandas pyarrow sre_yield pytest-xdist findspark
Here I decide to use one conda env "cudftest" for compiling and use another conda env named "rapids-0.18" to test the cudf_udf in Spark.
Of course you can choose to use one conda env if you want but it may include too many python packages in the end.
I just want to keep the conda env "rapids-0.18" to be as small as possible because eventually I need to distribute it to all Executors in Spark cluster.
1.b Compile from source code
cd ~/github/spark-rapids
# git checkout v0.4.0
mvn clean install -DskipTests
You can decide which version to compile. Here I am going to compile the 0.15-snapshot which is the current main branch. The current GA release is 0.4 though.
2. Run pandas cudf_udf Tests
Please follow this Doc on how to enable the pandas cudf_udf tests.
Basically pandas cudf_udf tests are inside "./integration_tests/runtests.py" with option "--cudf_udf".
The key is to make sure the all the python envs and needed jar file paths are correct.
2.a Create a conda env for running cudf_udf tests
Please follow the steps mentioned in rapids.ai to create the conda env with cudf installed.
For example:
conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
-c defaults cudf=0.18 python=3.7 cudatoolkit=11.0
2.b Install needed python packages needed by cudf_udf tests
conda activate rapids-0.18
conda install pandas
2.c Package your conda env
You can refer to this blog on how to package your conda env for spark job.
cd /home/xxx/miniconda3/envs
zip -r rapids-0.18.zip rapids-0.18/
mv rapids-0.18.zip ~/
cd ~/ && mkdir MYGLOBALENV
cd MYGLOBALENV/ && ln -s /home/xxx/miniconda3/envs/rapids-0.18/ rapids-0.18
cd ..
export PYSPARK_PYTHON=./MYGLOBALENV/rapids-0.18/bin/python
2.d Run the pandas cudf_udf tests
cd /home/xxx/github/spark-rapids/integration_tests
PYSPARK_PYTHON=/home/xxx/MYGLOBALENV/rapids-0.18/bin/python $SPARK_HOME/bin/spark-submit --jars "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/github/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/spark/rapids/cudf.jar,/home/xxx/github/spark-rapids/tests/target/rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.rapids.memory.gpu.allocFraction=0.3 \
--conf spark.rapids.python.memory.gpu.allocFraction=0.3 \
--conf spark.rapids.python.concurrentPythonWorkers=2 \
--py-files "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.executorEnv.PYTHONPATH="/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.executorEnv.PYSPARK_PYTHON=/home/xxx/rapids-0.18/bin/python \
--archives /home/xxx/rapids-0.18.zip#MYGLOBALENV \
./runtests.py -m "cudf_udf" -v -rfExXs --cudf_udf
Note1: Make sure all jar paths are correct.
Note2: Here I am using spark standalone cluster, that is why I used spark.executorEnv.PYSPARK_PYTHON. For Spark on YARN, you need to use corresponding parameters such as spark.yarn.appMasterEnv.PYSPARK_PYTHON .
Note3: Make sure $SPARK_HOME is set and also the spark cluster is working fine with Rapids for Spark enabled.
The expected result is: PASSED [100%].
Reference:
http://alkaline-ml.com/2018-07-02-conda-spark/
ReplyDeleteIn today's digital age, finding reliable sources of information is crucial for making informed decisions. Many websites offer diverse content to cater to various interests, and one such platform is zizodtf. It provides users with up-to-date news, insights, and resources that can enhance their knowledge and understanding of current trends. Exploring sites like this can significantly benefit those seeking accurate and comprehensive data online.