Open Knowledge Base

Spark writing to S3 failed: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

2022-07-19T10:25:00.004-07:00

Symptom:

When using Spark writing to S3, the insert query failed:

Caused by: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
	at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
	at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)
	at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1580)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:341)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:252)
	at org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
	at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)

Env:

spark-3.2.1-bin-hadoop3.2
hadoop-aws-3.2.3.jar
aws-java-sdk-bundle-1.11.375.jar
guava-14.0.1.jar

Solution:

Remove guava-14.0.1.jar from Spark and use the Hive's newer guava-27.0-jre.jar.

$  ls -altr $SPARK_HOME/jars/guava*.jar
lrwxrwxrwx 1 xxx xxx 46 Jul 19 09:39 /home/xxx/spark/myspark/jars/guava-27.0-jre.jar -> /home/xxx/hive/myhive/lib/guava-27.0-jre.jar

Spark writing to S3 failed: java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.

2022-07-19T10:14:00.003-07:00

Symptom:

When using Spark writing to S3, the insert query failed:

java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
	at org.apache.hadoop.fs.s3a.impl.StoreContext.createThrottledExecutor(StoreContext.java:292)
	at org.apache.hadoop.fs.s3a.impl.DeleteOperation.<init>(DeleteOperation.java:206)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:2468)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:532)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:551)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:242)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:262)

Env:

spark-3.2.1-bin-hadoop3.2

hadoop-aws-3.2.1.jar

aws-java-sdk-bundle-1.11.375.jar

Solution:

After upgrading hadoop-aws-3.2.1.jar to hadoop-aws-3.2.3.jar, it works fine.

$  ls -altr $SPARK_HOME/jars|grep -i aws
-rw-rw-r--  1 xxx xxx 98732349 Jul 26  2018 aws-java-sdk-bundle-1.11.375.jar
-rw-rw-r--  1 xxx xxx   506819 Jul 19 10:03 hadoop-aws-3.2.3.jar

How to access Azure Open Dataset from Spark

2021-09-23T17:02:00.000-07:00

Goal:

This article explains how to access Azure Open Dataset from Spark.

Env:

spark-3.1.1-bin-hadoop2.7

Solution:

Microsoft Azure Open Dataset is curated and cleansed data - including weather, census, and holidays - that you can use with minimal preparation to enrich ML models.

If we want to access it from local Spark environment, we need 2 jars :

azure-storage-<version>.jar
hadoop-azure-<version>.jar

My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar.

In this example, I downloaded below two jars:

1. Add above 2 jars into Spark classpath.

spark.executor.extraClassPath
spark.driver.extraClassPath

2. Add Azure Blob Storage related Hadoop configs

For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):

sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

3. Follow PySpark commands to access Azure Open Dataset

For example, the PySpark commands are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.

Understand Decimal precision and scale calculation in Spark using GPU or CPU mode

2021-05-03T15:20:00.007-07:00

Goal:

This article research on how Spark calculates the Decimal precision and scale using GPU or CPU mode.

Basically we will test Addition/Subtraction/Multiplication/Division/Modulo/Union in this post.

Env:

Spark 3.1.1

Rapids accelerator 0.5 snapshot with cuDF 0.19 snapshot jar

Concept:

Spark's logic to calculates the Decimal precision and scale is inside DecimalPrecision.scala.

 * In particular, if we have expressions e1 and e2 with precision/scale p1/s1 and p2/s2
 * respectively, then the following operations have the following precision / scale:
 *
 *   Operation    Result Precision                        Result Scale
 *   ------------------------------------------------------------------------
 *   e1 + e2      max(s1, s2) + max(p1-s1, p2-s2) + 1     max(s1, s2)
 *   e1 - e2      max(s1, s2) + max(p1-s1, p2-s2) + 1     max(s1, s2)
 *   e1 * e2      p1 + p2 + 1                             s1 + s2
 *   e1 / e2      p1 - s1 + s2 + max(6, s1 + p2 + 1)      max(6, s1 + p2 + 1)
 *   e1 % e2      min(p1-s1, p2-s2) + max(s1, s2)         max(s1, s2)
 *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2)         max(s1, s2)

This matches the Hive's rule in this Hive Decimal Precision/Scale Support document.

Other than that, Spark has a parameter spark.sql.decimalOperations.allowPrecisionLoss (default true) to control if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals.

Now let's look at GPU mode(with Rapids accelerator)'s limit:

Currently in Rapids accelerator 0.4.1/0.5 snapshot release, the limit for decimal is up to 18 digits(64bits) as per this Doc.

So if the precision is > 18, it will fallback to CPU mode.

Below let's do some tests to confirm the theory matches practice.

Solution:

1. Prepare an example Dataframe with different types of decimal

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.types._
spark.conf.set("spark.rapids.sql.enabled", true)
spark.conf.set("spark.rapids.sql.decimalType.enabled", true)

val df = spark.sparkContext.parallelize(Seq(1)).toDF()
val df2=df.withColumn("value82", (lit("123456.78").cast(DecimalType(8,2)))).
           withColumn("value63", (lit("123.456").cast(DecimalType(6,3)))).
           withColumn("value1510", (lit("12345.0123456789").cast(DecimalType(15,10)))).
           withColumn("value2510", (lit("123456789012345.0123456789").cast(DecimalType(25,10))))

df2.write.parquet("/tmp/df2.parquet")
val newdf2=spark.read.parquet("/tmp/df2.parquet")
newdf2.createOrReplaceTempView("df2")

newdf2's schema:

scala> newdf2.printSchema
root
 |-- value: integer (nullable = false)
 |-- value82: decimal(8,2) (nullable = true)
 |-- value63: decimal(6,3) (nullable = true)
 |-- value1510: decimal(15,10) (nullable = true)
 |-- value2510: decimal(25,10) (nullable = true)

2. GPU Mode (Result Decimal within GPU's limit : <=18 digits)

Below tests make sure all result decimal's precision is within GPU's limit which is 18 digits in this Rapids accelerator version.

So we only use 2 fields -- value82: decimal(8,2) and value63: decimal(6,3) of df2.

This is to confirm that the theory works fine in GPU mode or not.

To use above concept/theory to calculate the expected result precision and scale, let's use below code to calculate it in an easy way:

import scala.math.{max, min}
val (p1,s1)=(8,2)
val (p2,s2)=(6,3)

2.1 Addition

val df_plus=spark.sql("SELECT value82+value63 FROM df2")
df_plus.printSchema
df_plus.explain
df_plus.collect

Output:

scala> val df_plus=spark.sql("SELECT value82+value63 FROM df2")
df_plus: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(10,3)) + CAST(value63 AS DECIMAL(10,3))): decimal(10,3)]

scala> df_plus.printSchema
root
 |-- (CAST(value82 AS DECIMAL(10,3)) + CAST(value63 AS DECIMAL(10,3))): decimal(10,3) (nullable = true)


scala> df_plus.explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [gpucheckoverflow((gpupromoteprecision(cast(value82#58 as decimal(10,3))) + gpupromoteprecision(cast(value63#59 as decimal(10,3)))), DecimalType(10,3), true) AS (CAST(value82 AS DECIMAL(10,3)) + CAST(value63 AS DECIMAL(10,3)))#88]
   +- GpuFileGpuScan parquet [value82#58,value63#59] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/df2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value82:decimal(8,2),value63:decimal(6,3)>



scala> df_plus.collect
res21: Array[org.apache.spark.sql.Row] = Array([123580.236])

The result Decimal is (10,3) which matches the theory, and it also runs on GPU as show from explain output.

scala> max(s1, s2) + max(p1-s1, p2-s2) + 1
res7: Int = 10

scala> max(s1, s2)
res8: Int = 3

Note: In the following tests, I will just show you the result instead of printing too much output to save the length of this post. But feel free to do the math yourself.

2.2 Subtraction

# Result Decimal (10,3)
val df_minus=spark.sql("SELECT value82-value63 FROM df2")
df_minus.printSchema
df_minus.explain
df_minus.collect

2.3 Multiplication

# Result Decimal (15,5) 
val df_multi=spark.sql("SELECT value82*value63 FROM df2")
df_multi.printSchema
df_multi.explain
df_multi.collect

Output:

scala> val df_multi=spark.sql("SELECT value82*value63 FROM df2")
df_multi: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3))): decimal(15,5)]

scala> df_multi.printSchema
root
 |-- (CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3))): decimal(15,5) (nullable = true)


scala> df_multi.explain
21/05/04 18:02:21 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> CheckOverflow((promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))), DecimalType(15,5), true) AS (CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3)))#96 could run on GPU
    @Expression <CheckOverflow> CheckOverflow((promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))), DecimalType(15,5), true) could run on GPU
      !Expression <Multiply> (promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))) cannot run on GPU because The actual output precision of the multiply is too large to fit on the GPU DecimalType(19,6)
        @Expression <PromotePrecision> promote_precision(cast(value82#58 as decimal(9,3))) could run on GPU
          @Expression <Cast> cast(value82#58 as decimal(9,3)) could run on GPU
            @Expression <AttributeReference> value82#58 could run on GPU
        @Expression <PromotePrecision> promote_precision(cast(value63#59 as decimal(9,3))) could run on GPU
          @Expression <Cast> cast(value63#59 as decimal(9,3)) could run on GPU
            @Expression <AttributeReference> value63#59 could run on GPU

== Physical Plan ==
*(1) Project [CheckOverflow((promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))), DecimalType(15,5), true) AS (CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3)))#96]
+- GpuColumnarToRow false
   +- GpuFileGpuScan parquet [value82#58,value63#59] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/df2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value82:decimal(8,2),value63:decimal(6,3)>



scala> df_multi.collect
res27: Array[org.apache.spark.sql.Row] = Array([15241480.23168])

Here even though the result Decimal is just (15,5) but it still falls back on CPU.

This is because Spark inserts "PromotePrecision" to CAST both sides to the same type -- Decimal(9,3).

Currently GPU has to be very cautious on this PromotePrecision, so it thought the result is Decimal (19,6) instead of (15,5).

2.4 Division

# Result Decimal (18,9) -- Fallback on CPU
val df_div=spark.sql("SELECT value82/value63 FROM df2")
df_div.printSchema
df_div.explain
df_div.collect

2.5 Modulo

# Result Decimal (6,3) -- Fallback on CPU
val df_mod=spark.sql("SELECT value82 % value63 FROM df2")
df_mod.printSchema
df_mod.explain
df_mod.collect

Note: this is because Modulo is not supported for Decimal on GPU as per this supported_ops.md.

2.6 Union

# Result Decimal (9,3) 
val df_union=spark.sql("SELECT value82 from df2 union SELECT value63 from df2")
df_union.printSchema
df_union.explain
df_union.collect

3. GPU Mode fallback to CPU (19 ~ 38 digits)

Below tests may fall back to CPU if result decimal's precision is above GPU's limit.

So we only use 2 fields -- value82: decimal(8,2) and value1510: decimal(15,10) of df2.

3.1 Addition

# Result Decimal (17,10) -- within GPU limit
val df_plus=spark.sql("SELECT value82+value1510 FROM df2")
df_plus.printSchema
df_plus.explain
df_plus.collect

3.2 Subtraction

# Result Decimal (17,10) -- within GPU limit
val df_minus=spark.sql("SELECT value82-value1510 FROM df2")
df_minus.printSchema
df_minus.explain
df_minus.collect

3.3 Multiplication

# Result Decimal (24,12) -- outside of GPU limit
val df_multi=spark.sql("SELECT value82*value1510 FROM df2")
df_multi.printSchema
df_multi.explain

Output:

scala> val df_multi=spark.sql("SELECT value82*value1510 FROM df2")
df_multi: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10))): decimal(24,12)]

scala> df_multi.printSchema
root
 |-- (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10))): decimal(24,12) (nullable = true)


scala> df_multi.explain
21/05/04 18:44:46 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced; unsupported data types in output: DecimalType(24,12)
  !Expression <Alias> CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) AS (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10)))#132 cannot run on GPU because expression Alias CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) AS (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10)))#132 produces an unsupported type DecimalType(24,12); expression CheckOverflow CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) produces an unsupported type DecimalType(24,12)
    !Expression <CheckOverflow> CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) cannot run on GPU because expression CheckOverflow CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) produces an unsupported type DecimalType(24,12)
      !Expression <Multiply> (promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))) cannot run on GPU because The actual output precision of the multiply is too large to fit on the GPU DecimalType(33,20)
        @Expression <PromotePrecision> promote_precision(cast(value82#58 as decimal(16,10))) could run on GPU
          @Expression <Cast> cast(value82#58 as decimal(16,10)) could run on GPU
            @Expression <AttributeReference> value82#58 could run on GPU
        @Expression <PromotePrecision> promote_precision(cast(value1510#60 as decimal(16,10))) could run on GPU
          @Expression <Cast> cast(value1510#60 as decimal(16,10)) could run on GPU
            @Expression <AttributeReference> value1510#60 could run on GPU

== Physical Plan ==
*(1) Project [CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) AS (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10)))#132]
+- GpuColumnarToRow false
   +- GpuFileGpuScan parquet [value82#58,value1510#60] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/df2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value82:decimal(8,2),value1510:decimal(15,10)>



scala> df_multi.collect
res51: Array[org.apache.spark.sql.Row] = Array([1524075473.257763907942])

3.4 Division

# Result Decimal (34,18) -- outside of GPU limit
val df_div=spark.sql("SELECT value82/value1510 FROM df2")
df_div.printSchema
df_div.explain
df_div.collect

3.5 Modulo

# Result Decimal(15,10) -- within GPU limit, but fallback on CPU
val df_mod=spark.sql("SELECT value82 % value1510 FROM df2")
df_mod.printSchema
df_mod.explain
df_mod.collect

Note: this is because Modulo is not supported for Decimal on GPU as per this supported_ops.md.

3.6 Union

# Result Decimal (16,10) -- within GPU limit
val df_union=spark.sql("SELECT value82 from df2 union SELECT value1510 from df2")
df_union.printSchema
df_union.explain
df_union.collect

4. Above decimal max range (> 38 digits)

If the result decimal is above 38 digits, spark.sql.decimalOperations.allowPrecisionLoss can be used to control the behavior.

So we only use 2 fields -- value1510: decimal(15,10) and value2510: decimal(25,10) of df2.

# Result Decimal (38,17)
val df_multi=spark.sql("SELECT value1510*value2510 FROM df2")
df_multi.printSchema
df_multi.explain
df_multi.collect

As per the theory, the result decimal should be (41,20):

scala> val (p1,s1)=(15,10)
p1: Int = 15
s1: Int = 10

scala> val (p2,s2)=(25,10)
p2: Int = 25
s2: Int = 10

scala> p1 + p2 + 1
res31: Int = 41

scala> s1 + s2
res32: Int = 20

However since 41>38, so another function adjustPrecisionScale inside DecimalType.scala is called to adjust the precision and scale.

For this specific example, below code logic is applied:

    } else {
      // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION.
      val intDigits = precision - scale
      // If original scale is less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise
      // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits
      val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)
      // The resulting scale is the maximum between what is available without causing a loss of
      // digits for the integer part of the decimal and the minimum guaranteed scale, which is
      // computed above
      val adjustedScale = Math.max(MAX_PRECISION - intDigits, minScaleValue)

      DecimalType(MAX_PRECISION, adjustedScale)
    }

So intDigits=41-20=21, minScaleValue=6, adjustedScale=max(38-21,6)=17.

That is why the result decimal is (38,17).

Since above function is only called when spark.sql.decimalOperations.allowPrecisionLoss=true, so if we set it false, it will return null:

scala> df_multi.collect
res67: Array[org.apache.spark.sql.Row] = Array([null])

References:

https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf

kubelet failed to start after rebooting

2021-04-30T11:19:00.005-07:00

Symptom:

kubelet failed to start after rebooting.

Env:

Ubuntu 18.04

Kubernetes 1.19

Root Cause:

From "journalctl -xefu kubelet", we can find out the root cause:

kubelet[11111]: F0430 xx:xx:xx.123456   11111 server.go:265] failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained:

Basically it means after rebooting, swap is on somehow.

Solution:

As mentioned in another blog "How to install a Kubernetes Cluster on CentOS 7", follow step 1.2 Disable Swap.

swapoff -a

And then comment out the swap entries in /etc/fstab.

After that, "systemctl status kubelet" should show kubelet is active (running).

How to use Spark Operator to run Spark job with Rapids Accelerator

2021-04-29T17:08:00.010-07:00

Goal:

This article shares the steps on how to run Spark job with Rapids Accelerator using Spark Operator in a Kubernetes Cluster.

Env:

Spark 3.1.1

Rapids Accelerator 0.4.1 with cuDF 0.18.1

Kubernetes Cluster 1.19

Spark Operator

Solution:

As per SPARK-33005, Spark on Kubernetes is GA in Spark 3.1.1.

In the Rapids Accelerator official Doc: Getting Started with RAPIDS and Kubernetes, it shares the steps on how to use spark-submit/spark-shell to directly submit Spark jobs into a Kubernetes Cluster.

This article will mainly focus on how to use Spark Operator to do the same thing.

Here we assume you already have a working Kubernetes Cluster with NVIDIA GPU support, and also built your own Spark docker image by following the above Getting Started with RAPIDS and Kubernetes.

1. Copy your application into the docker image

When following above Getting Started with RAPIDS and Kubernetes, make sure you modify the Dockerfile to copy your application(such as jars, python files) into the docker image.

This is because, as of today, as per the Spark Operator user guide : "A SparkApplication should set .spec.deployMode to cluster, as client is not currently implemented. The driver pod will then run spark-submit in client mode internally to run the driver program. "

Here we created a below test.py and copy it into docker image under directory "/opt/sparkRapidsPlugin":

from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df=sqlContext.createDataFrame([1,2,3], "int").toDF("value")
df.createOrReplaceTempView("df")
sqlContext.sql("SELECT * FROM df WHERE value<>1").explain()
sqlContext.sql("SELECT * FROM df WHERE value<>1").show()
sc.stop()

Modify Dockerfile to add below:

COPY test.py /opt/sparkRapidsPlugin

2. Create spark-operator in a namespace named "spark-operator" using helm chart.

Here we just follow the Spark Operator quick start guide.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace

In the end, if you want to delete this chart, use below command:

helm uninstall my-release --namespace spark-operator

3. Check what objects are created in Kubernetes Cluster

$ kubectl get pods -n spark-operator
NAME                                        READY   STATUS    RESTARTS   AGE
my-release-spark-operator-599f575d4-cjlmz   1/1     Running   0          62s

$ kubectl get deployment -n spark-operator
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
my-release-spark-operator   1/1     1            1           101s

$ kubectl get clusterrolebinding |grep spark-operator
my-release-spark-operator                              ClusterRole/my-release-spark-operator                                              5m28s

$ kubectl describe clusterrolebinding my-release-spark-operator
Name:         my-release-spark-operator
Labels:       app.kubernetes.io/instance=my-release
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=spark-operator
              app.kubernetes.io/version=v1beta2-1.2.3-3.1.1
              helm.sh/chart=spark-operator-1.1.0
Annotations:  meta.helm.sh/release-name: my-release
              meta.helm.sh/release-namespace: spark-operator
Role:
  Kind:  ClusterRole
  Name:  my-release-spark-operator
Subjects:
  Kind            Name                       Namespace
  ----            ----                       ---------
  ServiceAccount  my-release-spark-operator  spark-operator


$ kubectl get role -n spark-operator
NAME         CREATED AT
spark-role   2021-04-29T16:16:32Z

4. Check the status of spark-operator

$ helm status --namespace spark-operator my-release
NAME: my-release
LAST DEPLOYED: Thu Apr 29 09:20:14 2021
NAMESPACE: spark-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

5. Run a Spark Pi job without using Rapids Accelerator

This is just to make sure Spark Operator itself is working fine without adding complexity of troubleshooting.

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator
kubectl apply -f examples/spark-pi.yaml

Note: Driver Pod will use "spark" service account by default. So make sure you either have granted enough privileges to "spark" or modify the yaml file as whatever you need.

It should completed successfully:

$ kubectl get pods
NAME              READY   STATUS      RESTARTS   AGE
spark-pi-driver   0/1     Completed   0          48s

You can also check the status of sparkapplications (custom resource definition aka CRD) using kubectl:

$ kubectl get sparkapplications spark-pi -o=yaml
...
status:
  applicationState:
    state: COMPLETED
...

Or describe it to get the events:

$ kubectl describe sparkapplication spark-pi
...
Events:
  Type    Reason                     Age                    From            Message
  ----    ------                     ----                   ----            -------
  Normal  SparkApplicationAdded      7m22s                  spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  7m20s                  spark-operator  SparkApplication spark-pi was submitted successfully
  Normal  SparkDriverRunning         7m9s                   spark-operator  Driver spark-pi-driver is running
  Normal  SparkExecutorPending       7m4s                   spark-operator  Executor spark-pi-d25689791e785e41-exec-1 is pending
  Normal  SparkExecutorRunning       7m1s                   spark-operator  Executor spark-pi-d25689791e785e41-exec-1 is running
  Normal  SparkExecutorCompleted     6m58s (x2 over 6m58s)  spark-operator  Executor spark-pi-d25689791e785e41-exec-1 completed
  Normal  SparkDriverCompleted       6m58s (x2 over 6m58s)  spark-operator  Driver spark-pi-driver completed
  Normal  SparkApplicationCompleted  6m58s                  spark-operator  SparkApplication spark-pi completed
...

6. Build sparkctl

sparkctl has more functionality to support Spark on K8s. It is shipped inside the downloaded Spark Operator repo.

Let's build it and use it instead of kubectl.

6.1 Install Golang

Follow https://golang.org/doc/install to install Golang on Mac.

After that, set the PATH in .bash_profile:

export PATH=$PATH:/usr/local/go/bin

6.2 Build sparkctl

cd sparkctl
go build -o sparkctl

After that, set PATH for this sparkctl as well.

7. Run a Spark job with Rapids Accelerator

7.1 Create a yaml file named testpython-rapids.yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: testpython-rapids
  namespace: default
spec:
  sparkConf:
    "spark.ui.port": "4045"
    "spark.rapids.sql.concurrentGpuTasks": "1"
    "spark.executor.resource.gpu.amount": "1"
    "spark.task.resource.gpu.amount": "1"
    "spark.executor.memory": "1g"
    "spark.rapids.memory.pinnedPool.size": "2g"
    "spark.executor.memoryOverhead": "3g"
    "spark.locality.wait": "0s"
    "spark.sql.files.maxPartitionBytes": "512m"
    "spark.sql.shuffle.partitions": "10"
    "spark.plugins": "com.nvidia.spark.SQLPlugin"
    "spark.executor.resource.gpu.discoveryScript": "/opt/sparkRapidsPlugin/getGpusResources.sh"
    "spark.executor.resource.gpu.vendor": "nvidia.com"
    "spark.executor.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar:/opt/sparkRapidsPlugin/cudf.jar"
    "spark.driver.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar:/opt/sparkRapidsPlugin/cudf.jar"
  type: Python
  pythonVersion: 3
  mode: cluster
  image: "<image>"
  imagePullPolicy: Always
  mainApplicationFile: "local:///opt/sparkRapidsPlugin/test.py"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "1024m"
    labels:
      version: 3.1.1
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "5000m"
    gpu:
      name: "nvidia.com/gpu"
      quantity: 1
    labels:
      version: 3.1.1
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

7.2 Submit testpython-rapids

sparkctl create testpython-rapids.yaml

7.3 Check status of testpython-rapids

sparkctl status testpython-rapids

7.4 Check driver log

sparkctl log testpython-rapids

It should show GPU related query plan and the job results.

== Physical Plan ==
GpuColumnarToRow false
+- GpuFilter (gpuisnotnull(value#0) AND NOT (value#0 = 1))
   +- GpuRowToColumnar TargetSize(2147483647)
      +- *(1) Scan ExistingRDD[value#0]

7.5 Check executor log (when it is running)

sparkctl log testpython-rapids -e 1

7.6 Check the events

sparkctl event testpython-rapids

7.7 port forwarding (when driver is running)

sparkctl forward testpython-rapids --local-port 1234 --remote-port 4045

Then open localhost:1234 in browser.

Note: here the remote port 4045 is what we set for "spark.ui.port" in the testpython-rapids.yaml.

7.8 Delete the spark job

sparkctl delete testpython-rapids

Reference:

Rapids Accelerator compatibility related to spark.sql.legacy.parquet.datetimeRebaseModeInWrite

2021-04-27T20:18:00.001-07:00

Goal:

This article talked about the compatibility of Rapids Accelerator for Spark regarding parquet writing related to parameters spark.sql.legacy.parquet.datetimeRebaseModeInWrite etc.

Env:

Spark 3.1.1

Rapids Accelerator for Spark 0.5 snapshot

Solution:

Spark 3.0 made the change to use Proleptic Gregorian calendar instead of hybrid Gregorian+Julian calendar. So it caused some trouble when reading/writing to/from old "legacy" format from Spark 2.x.

Here is a nice blog to explain the change, and I would strongly recommend read it firstly.

SPARK-31405 (starting from 3.0) introduced parameter spark.sql.legacy.parquet.datetimeRebaseModeInWrite which influences on writes of the following parquet logical types:DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS.
SPARK-33210 (starting from 3.1) introduced another parameter spark.sql.legacy.parquet.int96RebaseModeInWrite for INT96 type(timestamp).

Here are 3 values:

EXCEPTION (Default): Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.
LEGACY: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files
CORRECTED: Spark will not do rebase and write the dates/timestamps as it is.

In CPU mode, let's firstly look at the behaviors.

1. CPU Mode

1.1 EXCEPTION (Default)

import java.sql.Date
spark.conf.set("spark.rapids.sql.enabled", false)
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "EXCEPTION")
Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_exception")

It will fail with:

Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: 
writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, 
as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. 
See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. 
Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, 
if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

1.2 LEGACY

spark.conf.set("spark.rapids.sql.enabled", false)
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")
spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("date_legacy")
spark.sql("SELECT * FROM date_legacy").explain
spark.sql("SELECT * FROM date_legacy").show

Output:

scala> spark.sql("SELECT * FROM date_legacy").explain
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [dt#30] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_legacy], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date>


scala> spark.sql("SELECT * FROM date_legacy").show
+----------+
|        dt|
+----------+
|1500-12-25|
+----------+

1.3 CORRECTED

spark.conf.set("spark.rapids.sql.enabled", false)
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")
spark.read.parquet("/tmp/testparquet_corrected").createOrReplaceTempView("date_corrected")
spark.sql("SELECT * FROM date_corrected").explain
spark.sql("SELECT * FROM date_corrected").show

Output:

scala> spark.sql("SELECT * FROM date_corrected").explain
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [dt#46] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_corrected], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date>


scala> spark.sql("SELECT * FROM date_corrected").show
+----------+
|        dt|
+----------+
|1500-12-25|
+----------+

2. GPU Mode

2.1 EXCEPTION (Default)

import java.sql.Date
spark.conf.set("spark.rapids.sql.enabled", true)
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "EXCEPTION")
Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_exception")

It will fail with:

Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: 
writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, 
as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. 
See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. 
Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, 
if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

2.2 LEGACY

spark.conf.set("spark.rapids.sql.enabled", true)
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")
spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("date_legacy")
spark.sql("SELECT * FROM date_legacy").explain
spark.sql("SELECT * FROM date_legacy").show

The data writing can finish successfully since we use LEGACY value, but it is done by CPU instead of GPU(see the warning message"Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because LEGACY rebase mode for dates and timestamps is not supported"):

scala> Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")
21/04/28 01:29:27 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because LEGACY rebase mode for dates and timestamps is not supported
  !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found
    @Expression <AttributeReference> dt#66 could run on GPU

Spark UI can show the query plan which is on CPU as well:

The data reading fails with below error message and suggest us to set spark.sql.legacy.parquet.datetimeRebaseModeInRead to CORRECTED.

scala> spark.sql("SELECT * FROM date_legacy").explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuFileGpuScan parquet [dt#69] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_legacy], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date>


scala> spark.sql("SELECT * FROM date_legacy").show
21/04/28 01:29:28 WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 19) (111.111.111.111 executor 0): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. The RAPIDS Accelerator does not support reading these 'LEGACY' files. To do so you should disable Parquet support in the RAPIDS Accelerator or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

Even after setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to CORRECTED or LEGACY, it still fails with the same error.

2.3 CORRECTED

spark.conf.set("spark.rapids.sql.enabled", true)
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")
spark.read.parquet("/tmp/testparquet_corrected").createOrReplaceTempView("date_corrected")
spark.sql("SELECT * FROM date_corrected").explain
spark.sql("SELECT * FROM date_corrected").show

The data writing can finish successfully on GPU since we use CORRECTED value:

scala> Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")
21/04/28 01:58:23 WARN GpuOverrides:
  !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found
    @Expression <AttributeReference> dt#140 could run on GPU

Spark UI can show the query plan which is on GPU as well:

The data reading also works fine on GPU:

scala> spark.sql("SELECT * FROM date_corrected").explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuFileGpuScan parquet [dt#143] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_corrected], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date>


scala> spark.sql("SELECT * FROM date_corrected").show
+----------+
|        dt|
+----------+
|1500-12-25|
+----------+

3. Int96 timestamp tests

Of course, we can do similar tests for int96 timestamp type using below scripts.

Here I will let you try it out.

spark.conf.set("spark.rapids.sql.enabled", true)

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "EXCEPTION")
Seq(java.sql.Timestamp.valueOf("1500-01-01 00:00:00")).toDF("ts").write.format("parquet").mode("overwrite").save("/tmp/testparquet_exception")
spark.read.parquet("/tmp/testparquet_exception").createOrReplaceTempView("ts_exception")
spark.sql("SELECT * FROM ts_exception").explain
spark.sql("SELECT * FROM ts_exception").show

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")
Seq(java.sql.Timestamp.valueOf("1500-01-01 00:00:00")).toDF("ts").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")
spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("ts_legacy")
spark.sql("SELECT * FROM ts_legacy").explain
spark.sql("SELECT * FROM ts_legacy").show

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
Seq(java.sql.Timestamp.valueOf("1500-01-01 00:00:00")).toDF("ts").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")
spark.read.parquet("/tmp/testparquet_corrected").createOrReplaceTempView("ts_corrected")
spark.sql("SELECT * FROM ts_corrected").explain
spark.sql("SELECT * FROM ts_corrected").show

4. 1582-10-15 behaviors

As you remember, the error message shows that "reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous".

Here we focus on date which is 1582-10-15.

Let's use below sample test program on both CPU mode and GPU mode, and change the date "1582-10-15" to older dates in the following tests.

spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
Seq(Date.valueOf("1582-10-15")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")

spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("date_legacy")
spark.sql("SELECT * FROM date_legacy").explain
spark.sql("SELECT * FROM date_legacy").show

4.1 1582-10-15

Both CPU and GPU Modes can successfully read it as 1582-10-15:

scala> spark.sql("SELECT * FROM date_legacy").show
+----------+
|        dt|
+----------+
|1582-10-15|
+----------+

4.2 1582-10-14

Both CPU and GPU Modes started to show ambiguous result: 1582-10-24 which is "original date"+10:

scala> spark.sql("SELECT * FROM date_legacy").show
+----------+
|        dt|
+----------+
|1582-10-24|
+----------+

This "original date"+10 behavior lasts until 1582-10-05.

4.3 1582-10-04

CPU Mode can successfully read it as 1582-10-04 going forward:

scala> spark.sql("SELECT * FROM date_legacy").show
+----------+
|        dt|
+----------+
|1582-10-04|
+----------+

However GPU Mode will fail since 1582-10-04:

Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. The RAPIDS Accelerator does not support reading these 'LEGACY' files. To do so you should disable Parquet support in the RAPIDS Accelerator or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
  at org.apache.spark.sql.rapids.execution.TrampolineUtil$.makeSparkUpgradeException(TrampolineUtil.scala:78)
  at com.nvidia.spark.RebaseHelper$.newRebaseExceptionInRead(RebaseHelper.scala:83)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$3(GpuParquetScan.scala:1162)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$2(GpuParquetScan.scala:1160)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$2$adapted(GpuParquetScan.scala:1158)
  at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:76)
  at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:74)
  at com.nvidia.spark.rapids.FileParquetPartitionReaderBase.closeOnExcept(GpuParquetScan.scala:504)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.readToTable(GpuParquetScan.scala:1158)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readBatch$1(GpuParquetScan.scala:1113)
  at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
  at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
  at com.nvidia.spark.rapids.FileParquetPartitionReaderBase.withResource(GpuParquetScan.scala:504)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.readBatch(GpuParquetScan.scala:1098)
  at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.next(GpuParquetScan.scala:926)
  at com.nvidia.spark.rapids.PartitionIterator.hasNext(GpuDataSourceRDD.scala:59)
  at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(GpuDataSourceRDD.scala:76)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:385)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
  at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.hasNext(limit.scala:62)
  at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:208)
  at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:225)
  at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
  at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  at org.apache.spark.scheduler.Task.run(Task.scala:131)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:834)

Spark Code -- Dig into SparkListenerEvent

2021-04-20T23:14:00.003-07:00

Goal:

This article digs into different types of SparkListenerEvent in Spark event log with some examples.

Understanding this can help us know how to pares Spark event log.

Env:

Apache Spark 3.1.1 source code

Solution:

WARNING: this article will help us understand all below SparkListenerEvent in Spark event log with examples. It may contains lots of Apache Source code analysis. If you do not like reading a bunch of source code, you can stop now.

As we know, Spark event log can be shown in Spark HistoryServer(SHS) UI nicely. Then why do we try to parse the Spark event log manually?

The answer is, SHS only shows a small portion of the event log. There are lots of good stuff inside Spark event log such as task metrics, SQL Plan node accumulables, etc.

Basically event log is a file with different json lines, with each line coming from different Scala case classes which extend a trait(interface) called "SparkListenerEvent". Those definition is inside core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala.

Spark has its own EventLogFileReaders which is backward compatible, so we do not need to write json parser to parse the json object ourselves. One reason is our own json parser could be out of date as well if event log format changes in the future Spark versions.

So if our interest is to parse the event log, we can learn how SHS parses it. The logic is inside FsHistoryProvider.scala:

Utils.tryWithResource(EventLogFileReader.openEventLog(lastFile.getPath, fs))

If we used "jq" to format the event log in a human readable format, you can find the details of each json object.

Now let's look into each of below 21 types of SparkListenerEvent:

Some of them are very simple and straightforward, but some of them are very difficult to understand the logic: especially there are 6 different types of events handling SQL plan accumulables with each other, and AQE related events may override the query plan got from previous events.

SparkListenerLogStart
SparkListenerResourceProfileAdded
SparkListenerBlockManagerAdded
SparkListenerBlockManagerRemoved
SparkListenerEnvironmentUpdate
SparkListenerTaskStart
SparkListenerApplicationStart
SparkListenerExecutorAdded
SparkListenerExecutorRemoved
SparkListenerSQLExecutionStart
SparkListenerSQLExecutionEnd
SparkListenerDriverAccumUpdates
SparkListenerJobStart
SparkListenerStageSubmitted
SparkListenerTaskEnd
SparkListenerStageCompleted
SparkListenerJobEnd
SparkListenerTaskGettingResult
SparkListenerApplicationEnd
SparkListenerSQLAdaptiveExecutionUpdate
SparkListenerSQLAdaptiveSQLMetricUpdates

1. SparkListenerLogStart

Sample json object:

{
  "Event": "SparkListenerLogStart",
  "Spark Version": "3.1.1"
}

Case class definition:

case class SparkListenerLogStart(sparkVersion: String) extends SparkListenerEvent

Very straightforward we can get spark version from it.

2. SparkListenerResourceProfileAdded

Sample json object:

{
  "Event": "SparkListenerResourceProfileAdded",
  "Resource Profile Id": 0,
  "Executor Resource Requests": {
    "cores": {
      "Resource Name": "cores",
      "Amount": 16,
      "Discovery Script": "",
      "Vendor": ""
    },
    "memory": {
      "Resource Name": "memory",
      "Amount": 81920,
      "Discovery Script": "",
      "Vendor": ""
    },
    "offHeap": {
      "Resource Name": "offHeap",
      "Amount": 0,
      "Discovery Script": "",
      "Vendor": ""
    },
    "gpu": {
      "Resource Name": "gpu",
      "Amount": 1,
      "Discovery Script": "/xxx/xxx/xxx/xxx/examples/src/main/scripts/getGpusResources.sh",
      "Vendor": ""
    }
  },
  "Task Resource Requests": {
    "cpus": {
      "Resource Name": "cpus",
      "Amount": 1
    },
    "gpu": {
      "Resource Name": "gpu",
      "Amount": 0.25
    }
  }
}

Case class definition:

case class SparkListenerResourceProfileAdded(resourceProfile: ResourceProfile)
  extends SparkListenerEvent

What is ResourceProfile?

class ResourceProfile(
    val executorResources: Map[String, ExecutorResourceRequest],
    val taskResources: Map[String, TaskResourceRequest])

What are ExecutorResourceRequest and TaskResourceRequest?

class ExecutorResourceRequest(
    val resourceName: String,
    val amount: Long,
    val discoveryScript: String = "",
    val vendor: String = "") extends Serializable {
    ...

class TaskResourceRequests() extends Serializable {
  private val _taskResources = new ConcurrentHashMap[String, TaskResourceRequest]()
  def requests: Map[String, TaskResourceRequest] = _taskResources.asScala.toMap
  def requestsJMap: JMap[String, TaskResourceRequest] = requests.asJava
  def cpus(amount: Int): this.type = {
  def resource(resourceName: String, amount: Double): this.type = {
  ...

After some digging into, we know SparkListenerResourceProfileAdded contains the executor and task resource requests such as CPU, Memory, GPU, etc.

For GPU resource, it is a little difficult to get, because we need to get it from a Map instead of getting it directly by calling a method or a function.

3. SparkListenerBlockManagerAdded

Sample json object:

{
  "Event": "SparkListenerBlockManagerAdded",
  "Block Manager ID": {
    "Executor ID": "driver",
    "Host": "myhostname",
    "Port": 44159
  },
  "Maximum Memory": 3032481792,
  "Timestamp": 1618341863606,
  "Maximum Onheap Memory": 3032481792,
  "Maximum Offheap Memory": 0
}

Case class definition:

case class SparkListenerBlockManagerAdded(
    time: Long,
    blockManagerId: BlockManagerId,
    maxMem: Long,
    maxOnHeapMem: Option[Long] = None,
    maxOffHeapMem: Option[Long] = None) extends SparkListenerEvent {
}

What is BlockManagerId.scala?

class BlockManagerId private (
    private var executorId_ : String,
    private var host_ : String,
    private var port_ : Int,
    private var topologyInfo_ : Option[String])
  extends Externalizable {

SparkListenerBlockManagerAdded contains Executor's resource information such as executorId, hostname, port, and max memory size.

4. SparkListenerBlockManagerRemoved

Sample json object:

{
  "Event": "SparkListenerBlockManagerRemoved",
  "Block Manager ID": {
    "Executor ID": "1",
    "Host": "myhostname",
    "Port": 12345
  },
  "Timestamp": 1111111111111
}

Case class definition:

case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)

SparkListenerBlockManagerRemoved contains the timestamp when an executor gets removed.

Normally it means some executors fails with some error and we may see it come together with SparkListenerExecutorRemoved.

5. SparkListenerEnvironmentUpdate

Sample json object:

{
  "Event": "SparkListenerEnvironmentUpdate",
  "JVM Information": {
    "Java Home": "/xxx/xxx/xxx/envs/xxx",
    "Java Version": "11.0.9.1-internal (Oracle Corporation)",
    "Scala Version": "version 2.12.10"
  },
  "Spark Properties": {
    "spark.rapids.sql.exec.CollectLimitExec": "true",
    "spark.executor.resource.gpu.amount": "1",
    "spark.rapids.sql.concurrentGpuTasks": "1",
    ...
    }
  "Hadoop Properties": {
    "yarn.resourcemanager.amlauncher.thread-count": "50",
    "dfs.namenode.resource.check.interval": "5000",
    ...
    }
  "System Properties": {
    "java.io.tmpdir": "/tmp",
    "line.separator": "\n",   
    ... 
    }
  "Classpath Entries": {
    "/home/xxx/spark/jars/curator-framework-2.7.1.jar": "System Classpath",
    "/home/xxx/spark/jars/parquet-encoding-1.10.1.jar": "System Classpath",
    "/home/xxx/spark/jars/commons-dbcp-1.4.jar": "System Classpath",
    ...
 }
}

Case class definition:

case class SparkListenerEnvironmentUpdate(environmentDetails: Map[String, Seq[(String, String)]])
  extends SparkListenerEvent

SparkListenerEnvironmentUpdate is a Map which contains the Spark/Hadoop/System/... properties.

It is useful for us to do some parameter checks.

6. SparkListenerTaskStart

Sample json object:

{
  "Event": "SparkListenerTaskStart",
  "Stage ID": 0,
  "Stage Attempt ID": 0,
  "Task Info": {
    "Task ID": 0,
    "Index": 0,
    "Attempt": 0,
    "Launch Time": 1618341870400,
    "Executor ID": "0",
    "Host": "111.111.111.111",
    "Locality": "PROCESS_LOCAL",
    "Speculative": false,
    "Getting Result Time": 0,
    "Finish Time": 0,
    "Failed": false,
    "Killed": false,
    "Accumulables": []
  }
}

Case class definition:

case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)

What is TaskInfo?

class TaskInfo(
    val taskId: Long,
    /**
     * The index of this task within its task set. Not necessarily the same as the ID of the RDD
     * partition that the task is computing.
     */
    val index: Int,
    val attemptNumber: Int,
    val launchTime: Long,
    val executorId: String,
    val host: String,
    val taskLocality: TaskLocality.TaskLocality,
    val speculative: Boolean) {

SparkListenerTaskStart contains the task start time, related executor information.

Note: Normally the accumulables are empty in the beginning.

7. SparkListenerApplicationStart

Sample json object:

{
  "Event": "SparkListenerApplicationStart",
  "App Name": "Spark Pi",
  "App ID": "app-20210413122423-0000",
  "Timestamp": 1618341862473,
  "User": "xxxx"
}

Case class definition:

case class SparkListenerApplicationStart(
    appName: String,
    appId: Option[String],
    time: Long,
    sparkUser: String,
    appAttemptId: Option[String],
    driverLogs: Option[Map[String, String]] = None,
    driverAttributes: Option[Map[String, String]] = None) extends SparkListenerEvent

SparkListenerApplicationStart contains the application start time, application name, application ID and user name.

Normally only one of such event in each event log.

8. SparkListenerExecutorAdded

Sample json object:

{
  "Event": "SparkListenerExecutorAdded",
  "Timestamp": 1618341865601,
  "Executor ID": "0",
  "Executor Info": {
    "Host": "111.111.111.111",
    "Total Cores": 16,
    "Log Urls": {
      "stdout": "http://111.111.111.111:8081/logPage/?appId=app-20210413122423-0000&executorId=0&logType=stdout",
      "stderr": "http://111.111.111.111:8081/logPage/?appId=app-20210413122423-0000&executorId=0&logType=stderr"
    },
    "Attributes": {},
    "Resources": {
      "gpu": {
        "name": "gpu",
        "addresses": [
          "0"
        ]
      }
    },
    "Resource Profile Id": 0
  }
}

Case class definition:

case class SparkListenerExecutorAdded(time: Long, executorId: String, executorInfo: ExecutorInfo)

What is ExecutorInfo?

class ExecutorInfo(
    val executorHost: String,
    val totalCores: Int,
    val logUrlMap: Map[String, String],
    val attributes: Map[String, String],
    val resourcesInfo: Map[String, ResourceInformation],
    val resourceProfileId: Int) {

SparkListenerExecutorAdded contains the timestamp, and executor information.

Note that, it is related to a resource profile.

9. SparkListenerExecutorRemoved

Sample json object:

{
  "Event": "SparkListenerExecutorRemoved",
  "Timestamp": 1111111111111,
  "Executor ID": "1",
  "Removed Reason": "Container from a bad node: container_1111111111111_1111_11_111111 on host: abc.abc.abc.abc"
}

Case class definition:

case class SparkListenerExecutorRemoved(time: Long, executorId: String, reason: String)

SparkListenerExecutorRemoved contains the timestamp and the reason why an executor gets removed.

Normally it means executor fails due to some reason such as OOM.

10. SparkListenerSQLExecutionStart

Sample json object:

{
  "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart",
  "executionId": 3,
  "description": "select count(*) from customer a, customer b where a.c_customer_id=b.c_customer_id+10",
  "details": "org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)\njava.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\njava.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\njava.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.base/java.lang.reflect.Method.invoke(Method.java:566)\norg.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)\norg.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)\norg.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)\norg.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)\norg.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)\norg.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)\norg.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)\norg.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)",
  "physicalPlanDescription": "== Physical Plan ==\nGpuColumnarToRow (14)\n+- GpuHashAggregate (13)\n   +- GpuShuffleCoalesce (12)\n      +- GpuColumnarExchange (11)\n         +- GpuHashAggregate (10)\n            +- GpuProject (9)\n               +- GpuBroadcastHashJoin (8)\n                  :- GpuCoalesceBatches (3)\n                  :  +- GpuFilter (2)\n                  :     +- GpuScan parquet tpcds.customer (1)\n                  +- GpuBroadcastExchange (7)\n                     +- GpuCoalesceBatches (6)\n                        +- GpuFilter (5)\n                           +- GpuScan parquet tpcds.customer (4)\n\n\n(1) GpuScan parquet tpcds.customer\nOutput [1]: [c_customer_id#2]\nBatched: true\nLocation: InMemoryFileIndex [file:/home/xxxxx/data/tpcds_100G_parquet/customer]\nPushedFilters: [IsNotNull(c_customer_id)]\nReadSchema: struct<c_customer_id:string>\n\n(2) GpuFilter\nInput [1]: [c_customer_id#2]\nArguments: gpuisnotnull(c_customer_id#2)\n\n(3) GpuCoalesceBatches\nInput [1]: [c_customer_id#2]\nArguments: TargetSize(2147483647)\n\n(4) GpuScan parquet tpcds.customer\nOutput [1]: [c_customer_id#27]\nBatched: true\nLocation: InMemoryFileIndex [file:/home/xxxxx/data/tpcds_100G_parquet/customer]\nPushedFilters: [IsNotNull(c_customer_id)]\nReadSchema: struct<c_customer_id:string>\n\n(5) GpuFilter\nInput [1]: [c_customer_id#27]\nArguments: gpuisnotnull(c_customer_id#27)\n\n(6) GpuCoalesceBatches\nInput [1]: [c_customer_id#27]\nArguments: TargetSize(2147483647)\n\n(7) GpuBroadcastExchange\nInput [1]: [c_customer_id#27]\nArguments: HashedRelationBroadcastMode(List(knownfloatingpointnormalized(normalizenanandzero((cast(input[0, string, false] as double) + 10.0)))),false), [id=#97]\n\n(8) GpuBroadcastHashJoin\nLeft output [1]: [c_customer_id#2]\nRight output [1]: [c_customer_id#27]\nArguments: [gpuknownfloatingpointnormalized(gpunormalizenanandzero(cast(c_customer_id#2 as double)))], [gpuknownfloatingpointnormalized(gpunormalizenanandzero((cast(c_customer_id#27 as double) + 10.0)))], Inner, GpuBuildRight\n\n(9) GpuProject\nInput [2]: [c_customer_id#2, c_customer_id#27]\n\n(10) GpuHashAggregate\nInput: []\nKeys: []\nFunctions [1]: [partial_gpucount(1)]\nAggregate Attributes [1]: [count#46L]\nResults [1]: [count#47L]\n\n(11) GpuColumnarExchange\nInput [1]: [count#47L]\nArguments: gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#101]\n\n(12) GpuShuffleCoalesce\nInput [1]: [count#47L]\nArguments: 2147483647\n\n(13) GpuHashAggregate\nInput [1]: [count#47L]\nKeys: []\nFunctions [1]: [gpucount(1)]\nAggregate Attributes [1]: [count(1)#25L]\nResults [1]: [count(1)#25L AS count(1)#44L]\n\n(14) GpuColumnarToRow\nInput [1]: [count(1)#44L]\nArguments: false\n\n",
  "sparkPlanInfo": {
    "nodeName": "GpuColumnarToRow",
    "simpleString": "GpuColumnarToRow false",
    "children": [
      {
        "nodeName": "GpuHashAggregate",
        "simpleString": "GpuHashAggregate(keys=[], functions=[gpucount(1)]), filters=List(None))",
        "children": [
 ...
                                     "children": [
                                      {
                                        "nodeName": "GpuScan parquet tpcds.customer",
                                        "simpleString": "GpuFileGpuScan parquet tpcds.customer[c_customer_id#2] Batched: true, DataFilters: [isnotnull(c_customer_id#2)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxxxx/data/tpcds_100G_parquet/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_id)], ReadSchema: struct<c_customer_id:string>",
                                        "children": [],
                                        "metadata": {},
                                        "metrics": [
                                          {
                                            "name": "number of files read",
                                            "accumulatorId": 209,
                                            "metricType": "sum"
                                          },

Case class definition:

case class SparkListenerSQLExecutionStart(
    executionId: Long,
    description: String,
    details: String,
    physicalPlanDescription: String,
    sparkPlanInfo: SparkPlanInfo,
    time: Long)
  extends SparkListenerEvent

What is SparkPlanInfo?

class SparkPlanInfo(
    val nodeName: String,
    val simpleString: String,
    val children: Seq[SparkPlanInfo],
    val metadata: Map[String, String],
    val metrics: Seq[SQLMetricInfo]) {

What is SQLMetricInfo?

class SQLMetricInfo(
    val name: String,
    val accumulatorId: Long,
    val metricType: String)

Now we are getting the complex part.

SparkListenerSQLExecutionStart contains the query plan, and its accumulables(metrics) definition.

Remember that here the query plan information may be overridden by upcoming AQE related events SparkListenerSQLAdaptiveExecutionUpdate;

And the accumulables(metrics) definition could be overriden by upcoming AQE related events SparkListenerSQLAdaptiveSQLMetricUpdates.

So none of them are final. Please remember they may change later when parsing this event.

Note: The SQL plan accumulables are associated with its SQL Plan Node by nodeID!

For example, when the final parsing is done, it should show the mapping relationship between SQL plan nodeID <=> accumulatorId:

+-----+------+---------------------+-------------+-------------------------+------------+----------+
|sqlID|nodeID|nodeName             |accumulatorId|name                     |max_value   |metricType|
+-----+------+---------------------+-------------+-------------------------+------------+----------+
|11   |5     |Scan parquet         |123          |number of output rows    |11          |sum       |
|11   |5     |Scan parquet         |124          |number of files read     |1           |sum       |
|11   |5     |Scan parquet         |125          |metadata time            |1           |timing    |
|11   |5     |Scan parquet         |126          |size of files read       |1111        |size      |
|11   |5     |Scan parquet         |127          |scan time                |11          |timing    |

11. SparkListenerSQLExecutionEnd

Sample json object:

{
  "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd",
  "executionId": 0,
  "time": 1617729547596
}

Case class definition:

case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)

Easy: it contains the SQL end timestamp. If we map the end timestamp to previous start time, we can get the SQL duration in ms.

12. SparkListenerDriverAccumUpdates

Sample json object:

{
  "Event": "org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates",
  "executionId": 2,
  "accumUpdates": [
    [
      67,
      1
    ],
    [
      68,
      2
    ],
    [
      69,
      106281839
    ]
  ]
}

Case class definition:

 * @param executionId The execution id for a query, so we can find the query plan.
 * @param accumUpdates Map from accumulator id to the metric value (metrics are always 64-bit ints).
 
case class SparkListenerDriverAccumUpdates(
    executionId: Long,
    @JsonDeserialize(contentConverter = classOf[LongLongTupleConverter])
    accumUpdates: Seq[(Long, Long)])

SparkListenerDriverAccumUpdates mainly sends the accumulator id => accumulator value pairs.

To figure out what does this accumulator mean? we need to join previous SQLMetricInfo got from SparkListenerSQLExecutionStart and possibly upcoming SparkListenerSQLAdaptiveSQLMetricUpdates.

So we need to wait for all of the events SparkListenerSQLExecutionStart and SparkListenerSQLAdaptiveSQLMetricUpdates have been processed, and then we match the accumulator id to get the accumulator name and its associated query plan node.

13. SparkListenerJobStart

Sample json object:

{
  "Event": "SparkListenerJobStart",
  "Job ID": 0,
  "Submission Time": 1617729577252,
  "Stage Infos": [
    {
      "Stage ID": 0,
      "Stage Attempt ID": 0,
      "Stage Name": "executeColumnar at GpuShuffleCoalesceExec.scala:67",
      "Number of Tasks": 16,
      "RDD Info": [
        {
          "RDD ID": 3,
          "Name": "MapPartitionsRDD",
          "Scope": "{\"id\":\"7\",\"name\":\"GpuColumnarExchange\"}",
          "Callsite": "executeColumnar at GpuShuffleCoalesceExec.scala:67",
          "Parent IDs": [
            2
          ],
          "Storage Level": {
            "Use Disk": false,
            "Use Memory": false,
            "Deserialized": false,
            "Replication": 1
          },
          "Barrier": false,
          "Number of Partitions": 16,
          "Number of Cached Partitions": 0,
          "Memory Size": 0,
          "Disk Size": 0
        },
        ...
      "Accumulables": [],
      "Resource Profile Id": 0
    }
  ],
  "Stage IDs": [
    0,
    1,
    2
  ],
  "Properties": {
    "spark.rapids.sql.exec.CollectLimitExec": "true",
    "spark.executor.resource.gpu.amount": "1",
    "spark.rapids.sql.concurrentGpuTasks": "1",
    ...
  }
}

Case class definition:

case class SparkListenerJobStart(
    jobId: Int,
    time: Long,
    stageInfos: Seq[StageInfo],
    properties: Properties = null)
  extends SparkListenerEvent {
  // Note: this is here for backwards-compatibility with older versions of this event which
  // only stored stageIds and not StageInfos:
  val stageIds: Seq[Int] = stageInfos.map(_.stageId)
}

What is StageInfo?

class StageInfo(
    val stageId: Int,
    private val attemptId: Int,
    val name: String,
    val numTasks: Int,
    val rddInfos: Seq[RDDInfo],
    val parentIds: Seq[Int],
    val details: String,
    val taskMetrics: TaskMetrics = null,
    private[spark] val taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty,
    private[spark] val shuffleDepId: Option[Int] = None,
    val resourceProfileId: Int) {
  /** When this stage was submitted from the DAGScheduler to a TaskScheduler. */
  var submissionTime: Option[Long] = None
  /** Time when all tasks in the stage completed or when the stage was cancelled. */
  var completionTime: Option[Long] = None
  /** If the stage failed, the reason why. */
  var failureReason: Option[String] = None

  /**
   * Terminal values of accumulables updated during this stage, including all the user-defined
   * accumulators.
   */
  val accumulables = HashMap[Long, AccumulableInfo]()

SparkListenerJobStart has the StageInfo which contains RDD information.

When Job starts, it may also contains modified properties which may override the application level properties got from SparkListenerEnvironmentUpdate.

It means, in the same application(event log), spark parameters could change so do not assume the parameters are always static inside the same application.

14. SparkListenerStageSubmitted

Sample json object:

{
  "Event": "SparkListenerStageSubmitted",
  "Stage Info": {
    "Stage ID": 1,
    "Stage Attempt ID": 0,
    "Stage Name": "executeColumnar at GpuShuffleCoalesceExec.scala:67",
    "Number of Tasks": 1000,
    "RDD Info": [
      {
        "RDD ID": 8,
        "Name": "MapPartitionsRDD",
        "Scope": "{\"id\":\"3\",\"name\":\"GpuColumnarExchange\"}",
        "Callsite": "executeColumnar at GpuShuffleCoalesceExec.scala:67",
        "Parent IDs": [
          7
          ...
    "Submission Time": 1617729578789,
    "Accumulables": [],
    "Resource Profile Id": 0
  },
  "Properties": {
    "spark.rapids.sql.exec.CollectLimitExec": "true",
    "spark.executor.resource.gpu.amount": "1",
    ...

Case class definition:

case class SparkListenerStageSubmitted(stageInfo: StageInfo, properties: Properties = null)

Similar as SparkListenerJobStart, the StageInfo is the key content here.

And again, parameters could change here.

15. SparkListenerTaskEnd

Sample json object:

{
  "Event": "SparkListenerTaskEnd",
  "Stage ID": 1,
  "Stage Attempt ID": 0,
  "Task Type": "ShuffleMapTask",
  "Task End Reason": {
    "Reason": "Success"
  },
  "Task Info": {
    "Task ID": 17,
    "Index": 1,
    "Attempt": 0,
    "Launch Time": 1617729578802,
    "Executor ID": "0",
    "Host": "192.192.192.2",
    "Locality": "PROCESS_LOCAL",
    "Speculative": false,
    "Getting Result Time": 0,
    "Finish Time": 1617729578977,
    "Failed": false,
    "Killed": false,
    "Accumulables": [
      {
        "ID": 21,
        "Name": "output rows",
        "Update": "10",
        "Value": "10",
        "Internal": true,
        "Count Failed Values": true,
        "Metadata": "sql"
      },
      ...
   },
  "Task Executor Metrics": {
    "JVMHeapMemory": 0,
    "JVMOffHeapMemory": 0,
    "OnHeapExecutionMemory": 0,
    "OffHeapExecutionMemory": 0,
    ...
  "Task Metrics": {
    "Executor Deserialize Time": 73,
    "Executor Deserialize CPU Time": 16058445,
    "Executor Run Time": 92,
    "Executor CPU Time": 59345832,
    "Peak Execution Memory": 0,
    "Result Size": 5303,
    "JVM GC Time": 0,
    "Result Serialization Time": 0,
    "Memory Bytes Spilled": 0,
    "Disk Bytes Spilled": 0,
    "Shuffle Read Metrics": {
      "Remote Blocks Fetched": 0,
      "Local Blocks Fetched": 1,
      "Fetch Wait Time": 0,
      "Remote Bytes Read": 0,
      "Remote Bytes Read To Disk": 0,
      "Local Bytes Read": 20652,
      "Total Records Read": 1
    },
    "Shuffle Write Metrics": {
      "Shuffle Bytes Written": 86,
      "Shuffle Write Time": 2697954,
      "Shuffle Records Written": 1
    },
    "Input Metrics": {
      "Bytes Read": 0,
      "Records Read": 0
    },
    "Output Metrics": {
      "Bytes Written": 0,
      "Records Written": 0
    },
    "Updated Blocks": []
  }
}

Case class definition:

case class SparkListenerTaskEnd(
    stageId: Int,
    stageAttemptId: Int,
    taskType: String,
    reason: TaskEndReason,
    taskInfo: TaskInfo,
    taskExecutorMetrics: ExecutorMetrics,
    // may be null if the task has failed
    @Nullable taskMetrics: TaskMetrics)
  extends SparkListenerEvent

What is TaskMetrics?

class TaskMetrics private[spark] () extends Serializable {
  // Each metric is internally represented as an accumulator
  private val _executorDeserializeTime = new LongAccumulator
  private val _executorDeserializeCpuTime = new LongAccumulator
  private val _executorRunTime = new LongAccumulator
  private val _executorCpuTime = new LongAccumulator
  private val _resultSize = new LongAccumulator
  private val _jvmGCTime = new LongAccumulator
  private val _resultSerializationTime = new LongAccumulator
  private val _memoryBytesSpilled = new LongAccumulator
  private val _diskBytesSpilled = new LongAccumulator
  private val _peakExecutionMemory = new LongAccumulator
  private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]

SparkListenerTaskEnd may be the most important event if we want to profile the performance based on the event log.

Normally spark performance checking tool is always aggregating this TaskMetrics based on stage, job or SQL level.

In previous events, we can find out the job <-> stage and SQL<-> job mapping, together with the task <-> stage mapping got from this event, we can easily join them together and do aggregation.

Note that here, this event also sends out lots of accumulables.

Now we know how many of the events are sending and dealing with accumulables.

16. SparkListenerStageCompleted

Sample json object:

{
  "Event": "SparkListenerStageCompleted",
  "Stage Info": {
    "Stage ID": 0,
    "Stage Attempt ID": 0,
    "Stage Name": "executeColumnar at GpuShuffleCoalesceExec.scala:67",
    "Number of Tasks": 16,
    "RDD Info": [
      {
        "RDD ID": 3,
        "Name": "MapPartitionsRDD",
        "Scope": "{\"id\":\"7\",\"name\":\"GpuColumnarExchange\"}",
        "Callsite": "executeColumnar at GpuShuffleCoalesceExec.scala:67",
        "Parent IDs": [
          2
        ],
...
    "Submission Time": 1617729577270,
    "Completion Time": 1617729578759,
    "Accumulables": [
      {
        "ID": 47,
        "Name": "output rows",
        "Value": "2000000",
        "Internal": true,
        "Count Failed Values": true,
        "Metadata": "sql"
      },
    ],
    "Resource Profile Id": 0
  }
}

Case class definition:

case class SparkListenerStageCompleted(stageInfo: StageInfo) extends SparkListenerEvent

Again : StageInfo is the key content, and again, accumulables inside StageInfo.

17. SparkListenerJobEnd

Sample json object:

{
  "Event": "SparkListenerJobEnd",
  "Job ID": 0,
  "Completion Time": 1617729581438,
  "Job Result": {
    "Result": "JobSucceeded"
  }
}

Case class definition:

case class SparkListenerJobEnd(
    jobId: Int,
    time: Long,
    jobResult: JobResult)
  extends SparkListenerEvent

SparkListenerJobEnd shows the job end timestamp which can be calculated to job duration.

Here JobResult is a trait which can be used to fetch job status when finishing.

18. SparkListenerTaskGettingResult

Sample json object:

{
  "Event": "SparkListenerTaskGettingResult",
  "Task Info": {
    "Task ID": 1024,
    "Index": 7,
    "Attempt": 0,
    "Launch Time": 1617729607875,
    "Executor ID": "0",
    "Host": "111.111.111.111",
    "Locality": "PROCESS_LOCAL",
    "Speculative": false,
    "Getting Result Time": 1617729608076,
    "Finish Time": 0,
    "Failed": false,
    "Killed": false,
    "Accumulables": []
  }
}

Case class definition:

case class SparkListenerTaskGettingResult(taskInfo: TaskInfo) extends SparkListenerEvent

SparkListenerTaskGettingResult can show the getting result time for specific task.

19. SparkListenerApplicationEnd

Sample json object:

{
  "Event": "SparkListenerApplicationEnd",
  "Timestamp": 1617729611879
}

Case class definition:

case class SparkListenerApplicationEnd(time: Long) extends SparkListenerEvent

SparkListenerApplicationEnd only let us know the end timestamp for the application.

20. SparkListenerSQLAdaptiveExecutionUpdate

Sample json object:

{
  "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate",
  "executionId": 11,
 "physicalPlanDescription": "== Parsed Logical Plan ==...
 "sparkPlanInfo": {
    "nodeName": "GpuColumnarToRow",
    "simpleString": "GpuColumnarToRow false",
    "children": [
      {

Case class definition:

case class SparkListenerSQLAdaptiveExecutionUpdate(
  executionId: Long,
  physicalPlanDescription: String,
  sparkPlanInfo: SparkPlanInfo)
  extends SparkListenerEvent

SparkListenerSQLAdaptiveExecutionUpdate can be triggered when AQE is on, and it will override the query plan got from previous event SparkListenerSQLExecutionStart.

So if AQE is turned on(or in the future Spark 3.2 may turn on AQE by default), make sure wait for processing SparkListenerSQLAdaptiveExecutionUpdate before processing the query plan.

This can impact accumulables because accumulables are defined inside SparkPlanInfo.

So the best way is to wait for all AQE related events arrived, and then deduplicate on the SparkPlanInfo collected before starting to calculate any accumulables.

21. SparkListenerSQLAdaptiveSQLMetricUpdates

Sample json object:

{
  "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveSQLMetricUpdates",
  "executionId": 11,
  "sqlPlanMetrics": [
    {
      "name": "shuffle records written",
      "accumulatorId": 1111,
      "metricType": "sum"
    },
    {
      "name": "shuffle write time",
      "accumulatorId": 2222,
      "metricType": "nsTiming"
    },

Case class definition:

 case class SparkListenerSQLAdaptiveSQLMetricUpdates(
    executionId: Long,
    sqlPlanMetrics: Seq[SQLPlanMetric])
  extends SparkListenerEvent

Again, accumulables. This event will udpate/add accumulables from SQLPlanMetric.

In all, there are so many different kinds of events in Spark event log, and there could be more I believe.

We need to look into Spark source code to understand how they work together to define the performance metrics for application, SQL, Job, Stage, Task levels.

Especially for accumulables, there are more than 6 types of events dealing with it:

Define accumulables types: SparkListenerSQLExecutionStart, SparkListenerSQLAdaptiveExecutionUpdate
Send accumuables values: SparkListenerTaskEnd, SparkListenerStageCompleted, SparkListenerDriverAccumUpdates, SparkListenerSQLAdaptiveSQLMetricUpdates

For example, to calculate the max value for a accumulator, you may need to scan through all of above events to get the the real max value.

How to use latest version of Rapids Accelerator for Spark on EMR

2021-04-20T21:00:00.003-07:00

Goal:

This article shows how to use latest version of Rapids Accelerator for Spark on EMR.

Currently the latest EMR 6.2 only ships with Rapids Accelerator 0.2.0 with cuDF 0.15 jar.

However as of today, the latest Rapids Accelerator is 0.4.1 with cuDF 0.18 jar.

Note: This is NOT official steps on enabling rapids+Spark on EMR, but just some technical research.

Env:

EMR 6.2

Concept:

As per EMR Doc on Using the Nvidia Spark-RAPIDS Accelerator for Spark, it provides an option "enableSparkRapids":"true" in the configuration file when creating EMR.

Basically before we look for the solution to use latest version of Rapids Accelerator for Spark, we need to understand what does this option do.

As per my tests on EMR 6.2, this option will do below stuff:

1. Put the Rapids Accelerator 0.2.0 jar and cuDF 0.15 jar in below location with soft links

/usr/lib/spark/jars/rapids-4-spark_2.12-0.2.0.jar -> /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_2.12-0.2.0.jar
/usr/lib/spark/jars/cudf-0.15-cuda10-1.jar -> /usr/share/aws/emr/spark-rapids/lib/cudf-0.15-cuda10-1.jar

2. Put the getGpusResources.sh and xgboost4j-spark_3.0-1.0.0-0.2.0.jar

/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar
/usr/lib/spark/scripts/gpu/getGpusResources.sh

Now here is another action item which is done regardless of the option(event when enableSparkRapids":"false"):

3. Install the CUDA toolkit 10.1 with the soft link /usr/local/cuda pointing to it.

/usr/local/cuda -> /mnt/nvidia/cuda-10.1

After knowing all of above, then we may think of how about using bootstrap actions to change those jars, and install a newer version of CUDA toolkit say 11.0?

The answer is no. This is because unfortunately, our bootstrap action script will run BEFORE above steps.

It is like above steps are 2nd bootstrap actions.

Even if we used bootstrap actions script to replace above jars with the latest, and also install the latest CUDA toolkit 11.0 which can change the soft link /usr/local/cuda to point to cuda-11.0, eventually you will see 2 versions of Rapids Accelerator and cuDF jars in the same location, and also the the /usr/local/cuda will be changed back to point to cuda-10.1.

Solution:

The solution is to disable the option to set it false in configuration: "enableSparkRapids":"false".

Since we already know what this option does, we just need to use bootstrap actions to mimic the same thing(of course, using all latest&greatest versions).

1. Install CUDA Toolkit 11.0 and cuda-compat-11-0

We can not just simply install CUDA Toolkit 11.0 because the nvidia driver installed on EMR 6.2 is R418. To make CUDA Toolkit 11.0 running on the R418 driver, as per the CUDA compatibility matrix, the minimum required driver version is >= 450.36.06.

To make CUDA Toolkit 11.0 work on lower version of driver(forward compatible), we need to install a package named "cuda-compat".

We need to firstly know which commands to install this version by going to this CUDA download page.

Then how could we know the OS version on EMR? EMR has its own customized linux OS "Amazon Linux 2":

# cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

To figure out which package is compatible, we can get the base OS version by using this command:

rpm -E %{rhel}

Above will tell you it is redhat 7 based or compatible. Then we know which OS version to choose.

Below commands are what we need:

sudo yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
sudo yum clean all
sudo yum -y install cuda-toolkit-11-0
sudo yum -y install cuda-compat-11-0

2. Fetch the Rapids Accelerator jar and cuDF jar

You can always fetch the latest versions(or whatever version you want) by going to this download page.

Save the URLs for those 2 jars. Or you can choose to download them firstly and upload on a S3 bucket.

In below example, I will fetch one jar directly from a URL, and fetch another jar from S3 bucket.

3. Fetch the xgboost4j-spark jar

For spark 3.0, the latest jar can be downloaded here.

https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/

As of today, the latest version is:

https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.3.0-0.1.0/xgboost4j-spark_3.0-1.3.0-0.1.0.jar

Save this link.

4. Fetch the getGpusResources.sh

Basically this file exist in Spark directory as well, but sometimes we do not know if our bootstrap script or some other EMR internal bootstrap script will run firstly.

It is better to always choose a stable link. Here let's use below link:

https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh

5. Prepare a bootstrap action script

Sample script named bootstrap-install-cuda-compat-11.sh:

#!/bin/bash

set -ex

sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
sudo chmod a+rwx -R /sys/fs/cgroup/devices

echo "Install the cuda-compat-11-0"
sudo yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
sudo yum clean all
sudo yum -y install cuda-toolkit-11-0
sudo yum -y install cuda-compat-11-0 
sudo rm -f /usr/lib/spark/jars/rapids-4-spark_2.12-0.2.0.jar
sudo rm -f /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_2.12-0.2.0.jar
sudo rm -f /usr/lib/spark/jars/cudf-0.15-cuda10-1.jar
sudo rm -f /usr/share/aws/emr/spark-rapids/lib/cudf-0.15-cuda10-1.jar
sudo mkdir -p /usr/share/aws/emr/spark-rapids/lib/
sudo mkdir -p /usr/lib/spark/jars/
sudo wget https://xxx/cudf-<version>.jar -O /usr/share/aws/emr/spark-rapids/lib/cudf-<version>.jar
sudo ln -s /usr/share/aws/emr/spark-rapids/lib/cudf-<version>.jar /usr/lib/spark/jars/cudf-<version>.jar
sudo aws s3 cp s3://<BUCKET-NAME>/rapids-4-spark_<version>.jar /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_<version>.jar
sudo ln -s /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_<version>.jar /usr/lib/spark/jars/rapids-4-spark_<version>.jar
sudo wget https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.3.0-0.1.0/xgboost4j-spark_3.0-1.3.0-0.1.0.jar -O /usr/lib/spark/jars/xgboost4j-spark_3.0-1.3.0-0.1.0.jar
sudo mkdir -p /usr/lib/spark/scripts/gpu/
sudo wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh -O /usr/lib/spark/scripts/gpu/getGpusResources.sh
sudo chmod +x /usr/lib/spark/scripts/gpu/getGpusResources.sh
sudo alternatives --set java /usr/lib/jvm/java-11-amazon-corretto.x86_64/bin/java

Of course, you can make above shell script more robust by adding more checks but this is just a simplest demo.

I can find many other EMR bootstrap action scripts in this github which you can refer to.

And then copy the above bootstrap actions script on S3 bucket:

chmod +x bootstrap-install-cuda-compat-11.sh
aws s3 cp bootstrap-install-cuda-compat-11.sh s3://BUCKET-NAME/bootstrap-install-cuda-compat-11.sh

6. Prepare a configuration file

Say the name is EMR_java11_custom_bootstrap.json:

[
  {
    "Classification": "spark",
    "Properties": {
      "enableSparkRapids": "false"
    },
    "Configurations": []
  },
  {
    "Classification": "yarn-site",
    "Properties": {
      "yarn.nodemanager.linux-container-executor.cgroups.mount": "true",
      "yarn.nodemanager.linux-container-executor.cgroups.mount-path": "/sys/fs/cgroup",
      "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables": "/usr/bin",
      "yarn.nodemanager.linux-container-executor.cgroups.hierarchy": "yarn",
      "yarn.nodemanager.container-executor.class": "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor",
      "yarn.resource-types": "yarn.io/gpu",
      "yarn.nodemanager.resource-plugins": "yarn.io/gpu",
      "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices": "auto"
    },
    "Configurations": []
  },
  {
    "Classification": "container-executor",
    "Properties": {},
    "Configurations": [
      {
        "Classification": "gpu",
        "Properties": {
          "module.enabled": "true"
        },
        "Configurations": []
      },
      {
        "Classification": "cgroups",
        "Properties": {
          "root": "/sys/fs/cgroup",
          "yarn-hierarchy": "yarn"
        },
        "Configurations": []
      }
    ]
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.task.cpus ": "1",
      "spark.rapids.sql.explain": "ALL",
      "spark.submit.pyFiles": "/usr/lib/spark/jars/xgboost4j-spark_3.0-1.3.0-0.1.0.jar",
      "spark.executor.extraLibraryPath": "/usr/local/cuda-11.0/targets/x86_64-linux/lib:/usr/local/cuda-11.0/extras/CUPTI/lib64:/usr/local/cuda-11.0/compat/:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
      "spark.plugins": "com.nvidia.spark.SQLPlugin",
      "spark.executor.cores": "1",
      "spark.sql.files.maxPartitionBytes": "512m",
      "spark.executor.resource.gpu.discoveryScript": "/usr/lib/spark/scripts/gpu/getGpusResources.sh",
      "spark.sql.shuffle.partitions": "200",
      "spark.executor.defaultJavaOptions": "-XX:+IgnoreUnrecognizedVMOptions",
      "spark.task.resource.gpu.amount": "0.0625",
      "spark.rapids.memory.pinnedPool.size": "2G",
      "spark.executor.resource.gpu.amount": "1",
      "spark.rapids.sql.enabled": "true",
      "spark.sql.adaptive.enabled": "false",
      "spark.locality.wait": "0s",
      "spark.sql.sources.useV1SourceList": "",
      "spark.executor.memoryOverhead": "2G",
      "spark.driver.defaultJavaOptions": "-XX:+IgnoreUnrecognizedVMOptions",
      "spark.rapids.sql.concurrentGpuTasks": "1"
    },
    "Configurations": []
  },
  {
    "Classification": "capacity-scheduler",
    "Properties": {
      "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
    },
    "Configurations": []
  },
  {
    "Classification": "spark-env",
    "Properties": {},
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "JAVA_HOME": "/usr/lib/jvm/java-11-amazon-corretto.x86_64/"
        },
        "Configurations": []
      }
    ]
  }
]

Note: in above configuration file, we specified the /usr/local/cuda-11.0 in "spark.executor.extraLibraryPath" because the soft link /usr/local/cuda is still pointing to old cuda-10.1.

Note: /usr/local/cuda-11.0/compat/ contains the libs from cuda-compat-11-0 we installed earlier.

7. Start the EMR cluster using CLI

aws emr create-cluster \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \
--service-role EMR_DefaultRole \
--ec2-attributes KeyName=hao-emr,InstanceProfile=EMR_EC2_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \
                  InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \
                  InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \
--configurations file:///xxx/EMR_java11_custom_bootstrap.json \
--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://BUCKET-NAME/bootstrap-install-cuda-compat-11.sh \
--ebs-root-volume-size 100

Note: EBS root value size should be increased from default 10G to larger to avoid running out of disk space when installing packages using yum.

8. Monitor the bootstrap process

Normally master node will be ready first. So SSH on master node, and find the bootstrap actions' logs here: /mnt/var/log/bootstrap-actions

9. Test

Once all nodes are ready, run below in spark-shell from master node to make sure the GPU plan is shown:

val data = 1 to 100
val df1 = sc.parallelize(data).toDF()
val df2 = sc.parallelize(data).toDF()
val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value")
out.count()
out.explain()

10. Delete the EMR cluster once tests are done.

aws emr terminate-clusters --cluster-ids j-xxxxxxxxxxx

Common issues

1. ERROR NativeDepsLoader: Could not load cudf jni library...

Below errors and stack trace show in Spark executor logs when launching spark-shell:

Caused by: java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: /mnt/yarn/usercache/hadoop/appcache/application_xxx_xxx/container_xxx_xxx_01_xxxxx/tmp/nvcomp4429409488498215695.so: libcudart.so.11.0: cannot open shared object file: No such file or directory
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:167)
	... 34 more
Caused by: java.lang.UnsatisfiedLinkError: /mnt/yarn/usercache/hadoop/appcache/application_xxx_xxx/container_xxx_xxx_01_xxxxx/tmp/nvcomp4429409488498215695.so: libcudart.so.11.0: cannot open shared object file: No such file or directory
	at java.lang.ClassLoader$NativeLibrary.load(Native Method)
	at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934)
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817)
	at java.lang.Runtime.load0(Runtime.java:810)
	at java.lang.System.load(System.java:1088)
	at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:184)
	at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:198)
	at ai.rapids.cudf.NativeDepsLoader.lambda$loadNativeDeps$1(NativeDepsLoader.java:161)
	... 5 more

Make sure the CUDA Toolkit 11.0 is installed and is set in spark.executor.extraLibraryPath of configuration file.

2. ai.rapids.cudf.CudaException: CUDA driver version is insufficient for CUDA runtime version

Below errors and stack trace show in Spark executor logs when launching spark-shell:

ai.rapids.cudf.CudaException: CUDA driver version is insufficient for CUDA runtime version
	at ai.rapids.cudf.Cuda.setDevice(Native Method)
	at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:95)
	at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:122)
	at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:122)
	at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:130)
	at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:168)
	at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99)
	at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164)
	at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152)
	at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:220)
	at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221)
	at org.apache.spark.executor.Executor.<init>(Executor.scala:220)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Make sure the cuda-compat-11-0 is installed and its location is set correctly in spark.executor.extraLibraryPath of configuration file.

How to use NVIDIA Nsight Systems to profile a Spark on K8s job with Rapids Accelerator

2021-04-12T16:15:00.000-07:00

Goal:

This article explains how to use NVIDIA Nsight Systems to profile a Spark on K8s job with Rapids Accelerator.

This is a follow-up blog after How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator.

Env:

Spark 3.1.1 (on Kubernetes)

RAPIDS Accelerator for Apache Spark 0.5 snapshot

cuDF jar 0.19 snapshot

Solution:

Please read How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator blog and also Getting Started with RAPIDS and Kubernetes doc firstly.

This blog will mainly focus on differences for Spark on Kubernetes job.

1. Spark side

As we know, "nsys profile" should target a Spark Executor process. So the key is to find out how does Spark start an Executor in a Kubernetes cluster.

Basically it is resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh

  executor)
    shift 1
    CMD=(
      ${JAVA_HOME}/bin/java
      "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
      -Xms$SPARK_EXECUTOR_MEMORY
      -Xmx$SPARK_EXECUTOR_MEMORY
      -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
      org.apache.spark.executor.CoarseGrainedExecutorBackend
      --driver-url $SPARK_DRIVER_URL
      --executor-id $SPARK_EXECUTOR_ID
      --cores $SPARK_EXECUTOR_CORES
      --app-id $SPARK_APPLICATION_ID
      --hostname $SPARK_EXECUTOR_POD_IP
      --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
    )
...

# Execute the container CMD under tini for better hygiene
exec /usr/bin/tini -s -- "${CMD[@]}"

So we just need to change the CMD part to add "nsys profile" before that.

Such as:

  executor)
    shift 1
    CMD=(
      nsys profile -o /some_persistent_storage/test_%h_%p.qdrep
      ${JAVA_HOME}/bin/java
      "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
      -Xms$SPARK_EXECUTOR_MEMORY
      -Xmx$SPARK_EXECUTOR_MEMORY
      -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
      org.apache.spark.executor.CoarseGrainedExecutorBackend
      --driver-url $SPARK_DRIVER_URL
      --executor-id $SPARK_EXECUTOR_ID
      --cores $SPARK_EXECUTOR_CORES
      --app-id $SPARK_APPLICATION_ID
      --hostname $SPARK_EXECUTOR_POD_IP
      --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
    )
    ;;

Here we specified the output file to a persistent storage path which can be mounted in the docker container.

"%h" means hostname and "%p" means PID. For more details please refer to Nsight System user guide.

2. Docker image side

If you are using the Dockerfile.cuda , it actuall uses nvidia/cuda:10.1-devel-ubuntu18.04 as the base image. However this base image does not have Nsight Systems installed.

You need to either use your own base image which has Nsight Systems installed or adding the installation script into Dockerfile.cuda.

Below is one example to install Nsight Systems from CUDA 11.0.3 repo:

# Install Nsight-systems
RUN apt install -y wget && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
RUN dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
RUN apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
RUN apt-get update && apt-get install -y nsight-systems-2020.4.3

3. Build&upload the Docker Image and Run the Spark on K8s Job

The rest steps are the same as Getting Started with RAPIDS and Kubernetes doc.

===

How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator

2021-04-08T21:52:00.014-07:00

Goal:

This article explains how to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator.

Env:

Spark 3.1.1 (Standalone Cluster)

RAPIDS Accelerator for Apache Spark 0.5 snapshot

cuDF jar 0.19 snapshot

Solution:

1. Build the cuDF JARs with USE_NVTX option on.

Follow Doc: https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html

Note: Starting from cuDF 0.19, the USE_NVTX(NVIDIA Tools Extension) is on by default as per this PR so we do not need to build jar any more. It means in the future cuDF release(>=0.19) we can skip this step.

So here in this test, I just used the latest cuDF 0.19 snapshot jar and Rapids Accelerator 0..5 snapshot jar(built from source code manually) together. Note: these 2 jars are not stable releases.

2. Download nsight systems on your client machine

https://developer.nvidia.com/nsight-systems

Here I downloaded and installed on Mac where I will view the metrics later.

3. Make sure target machine has nsys installed and meet requirements.

Please refer to https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html for details.

Especially make sure the "Requirement" is met. Such as:

Use of Linux Perf: To collect thread scheduling data and IP (instruction pointer) samples, the Perf paranoid level on the target system must be 2 or less.

You can use "nsys status -e" to check the current status:

$  nsys status -e

Sampling Environment Check
Linux Kernel Paranoid Level = 3: Fail
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
Sampling Environment: Fail

See the product documentation for more information.

If the Kernel Paranoid Level check failed, then we can use below commands to check and enable it:

$ cat /proc/sys/kernel/perf_event_paranoid
3
$ sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'
$ cat /proc/sys/kernel/perf_event_paranoid
2
$ sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf'
$ nsys status -e

Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK

Note: there are other requirements like kernel version, glibc version, supported CUDA version. Please refer to above documentation.

4. Add extra java options in both driver and executor.

--conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true

You can consider putting those into spark-defaults.conf or specifying them each time for spark-shell/spark-sql/etc.

If you have other extraJavaOption(s), do not forget to append them.

5. Start spark-shell using "nsys profile"

nsys profile bash -c " \
CUDA_VISIBLE_DEVICES=0 ${SPARK_HOME}/sbin/start-slave.sh $master_url & \
$SPARK_HOME/bin/spark-shell; \
${SPARK_HOME}/sbin/stop-slave.sh"

6. Run some query

When quitting spark-shell, it will generate a *.qdrep file in current directory.

For example:

scala> :quit
:quit
stopping org.apache.spark.deploy.history.HistoryServer
stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdrep"
Report file moved to "/home/xxx/report1.qdrep"

7. Use "nsys stat" command on the target machine to check the report

You can choose to use "nsys stat" command on the target machine to check the report or use following GUI option.

"nsys stat" can show the CUDA API summary, GPU Kernel summary, GPU Memory time summary, NVTX push-pop range summary, etc:

$  nsys stats report8.qdrep
Using report8.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/cudaapisum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Num Calls      Average         Minimum        Maximum                Name
 -------  ---------------  ---------  ---------------  -------------  -------------  --------------------------
    66.8  152,391,401,099    192,250        792,673.1            679     18,448,141  cudaStreamSynchronize_ptsz
    31.2   71,169,590,822    114,830        619,782.2            195      9,667,534  cudaMemcpyAsync_ptsz
     0.7    1,565,365,626          7    223,623,660.9          3,454  1,565,334,856  cudaFree
     0.5    1,117,531,408     65,671         17,017.1          3,496        131,888  cudaLaunchKernel_ptsz
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
    37.5   83,645,234,788     14,576   5,738,558.9   5,554,755   6,897,949  void (anonymous namespace)::scatter_kernel<int, (anonymous namespace)::boolean_mask_filter<false>, …
    28.2   62,805,133,776      7,288   8,617,608.9   8,459,988   8,955,404  void cudf::binops::jit::kernel_v_v<bool, int, int, cudf::binops::jit::Greater>(int, bool*, int*, in…
    18.8   41,854,794,778      7,288   5,742,974.0   5,634,787   5,984,609  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
     8.7   19,342,375,816      7,289   2,653,639.2   2,575,613   2,869,850  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemtimesum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Operations  Average   Minimum  Maximum      Operation
 -------  ---------------  ----------  --------  -------  -------  ------------------
    47.8       78,733,508      82,908     949.6      608  610,013  [CUDA memcpy DtoH]
    35.7       58,761,119      80,174     732.9      640   13,792  [CUDA memset]
    16.4       26,979,351      31,900     845.7      671  662,844  [CUDA memcpy HtoD]
     0.1          136,064           8  17,008.0    1,632   32,640  [CUDA memcpy DtoD]
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemsizesum report8.sqlite] to console...

   Total     Operations   Average   Minimum   Maximum       Operation
 ----------  ----------  ---------  -------  ---------  ------------------
 37,577.836      31,900      1.178    0.004  7,813.324  [CUDA memcpy HtoD]
 32,226.750           8  4,028.344  244.187  7,812.500  [CUDA memcpy DtoD]
 24,145.266      82,908      0.291    0.001  7,812.500  [CUDA memcpy DtoH]
 16,326.898      80,174      0.204    0.001  7,812.500  [CUDA memset]
 ...
 Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum report8.sqlite] to console...

 Time(%)  Total Time (ns)  Instances      Average         Minimum        Maximum                  Range
 -------  ---------------  ---------  ---------------  -------------  -------------  -------------------------------
    41.0  209,116,856,965     10,002     20,907,504.2        117,938     23,476,086  libcudf:apply_boolean_mask
    41.0  209,039,719,367     10,002     20,899,792.0        116,416     23,467,375  libcudf:copy_if
    16.7   85,273,533,436     10,000      8,527,353.3      8,431,597     13,684,934  libcudf:cross_join
...

8. Copy the *.qdrep to the client machine where nsight systems is installed.

Open the *.qdrep using nsight systems.

My query in above #5 is a cross-join which takes around 6mins.

Normally I will firstly "Analysis Summary" tab to get the PID of Spark Executor(24897) which would be my focus.

Then move to "Timeline view" tab and identify Spark Executor process:

As we can see the CUDA HW(GPU) is showing busy(blue) for most of the time.

If we hover mouse on it, it can show you the CUDA Kernel running% at that time:

We can dig further into all threads of Spark Executor process, and we can identify the Executor Task 1 thread keeps calling CUDA API during that time.

And most importantly, here the "libcudf" and "NVTX(libcudf)" rows will show up.

Note:They will NOT show up if "NVTX" is not switched on when building cuDF jar.

Here "libcudf" row shows "cross_join" which match our query type.

"NVTX(libcudf)" row shows similar things under "CUDA HW" section:

Tips:

1. One useful tip is to pin the related rows and compare:

After those rows got pinned, if you scroll down/up, they will always be on top or at bottom, such as:

2. Change the time from "session time" to "global time"

After that, it will show machine time which can help you match the real world time.

3. How to start/stop collection manually

We can firstly "nsys launch" the Spark worker/slave, and then use "nsys start" and "nsys stop" to control the collection window manually.

a. Stop spark slaves manually

${SPARK_HOME}/sbin/stop-slave.sh

b. Start spark slaves using "nsys launch"

nsys launch bash -c "CUDA_VISIBLE_DEVICES=0 $SPARK_HOME/sbin/start-slave.sh spark://$HOSTNAME:7077 &"

c. Open another terminal session, run "nsys start"

$  nsys start
$  nsys sessions list
              ID         TIME                STATE LAUNCH NAME
         1028142        00:51           Collecting      1 [default]

d. Run a Spark job using either spark-shell or spark-submit or something else.

e. Run "nsys stop" after the Spark job completes

$  nsys stop
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-4026-c2c5-8a18-5372.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-4026-c2c5-8a18-5372.qdrep"
Report file moved to "/home/xxx/report10.qdrep"
stop executed

f. You can start&stop more collection windows.

g. Stop Spark-worker in the end.

References:

How to enable GpuKryoRegistrator on RAPIDS Accelerator for Spark

2021-04-04T13:12:00.007-07:00

Goal:

This article shares the steps to enable GpuKryoRegistrator on RAPIDS Accelerator for Spark.

Env:

Spark 3.1.1

RAPIDS Accelerator for Apache Spark 0.4.1

Solution:

As mentioned in Spark Tuning Doc:

Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.
Kryo serialization: Spark can also use the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

In Rapids Accelerator, it also has a class named com.nvidia.spark.rapids.GpuKryoRegistrator to use Kryo to register below classes in org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExec :

SerializeConcatHostBuffersDeserializeBatch
SerializeBatchDeserializeHostBuffer

How to enable?

Set below 2 parameters(eg, in spark-defaults.conf):

spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator com.nvidia.spark.rapids.GpuKryoRegistrator

Common Issues

This is a common issue in Kryo serialization : Buffer overflow.

For example, when running Q7 of TPCDS/NDS, it may fail with:

Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 636
    at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
    at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)
    at com.esotericsoftware.kryo.io.Output.write(Output.java:219)
    at java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)
    at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712)
    at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)
    at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
    at ai.rapids.cudf.JCudfSerialization$DataOutputStreamWriter.copyDataFrom(JCudfSerialization.java:600)
    at ai.rapids.cudf.JCudfSerialization$DataWriter.copyDataFrom(JCudfSerialization.java:546)
    at ai.rapids.cudf.JCudfSerialization.copySlicedAndPad(JCudfSerialization.java:1104)
    at ai.rapids.cudf.JCudfSerialization.copySlicedOffsets(JCudfSerialization.java:1332)
    at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1464)
    at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1517)
    at ai.rapids.cudf.JCudfSerialization.writeToStream(JCudfSerialization.java:1567)
    at org.apache.spark.sql.rapids.execution.SerializeBatchDeserializeHostBuffer.writeObject(GpuBroadcastExchangeExec.scala:153)
    at jdk.internal.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1145)
    at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1497)
    at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
    at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
    at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
    at com.esotericsoftware.kryo.serializers.JavaSerializer.write(JavaSerializer.java:51)
    ... 9 more

The fix is to increase the spark.kryoserializer.buffer.max from default 64M to bigger, say 512M:

spark.kryoserializer.buffer.max 512m

How to install a Kubernetes Cluster with NVIDIA GPU on AWS using DeepOps

2021-04-02T17:21:00.000-07:00

Goal:

This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS using DeepOps.

Env:

AWS EC2 (G4dn)

Ubuntu 18.04

Solution:

Most of the steps are the same as previous blog post: How to install a Kubernetes Cluster with NVIDIA GPU on AWS.

In that previous blog, it uses kubeadm to manually install a Kubernetes Cluster by installing below components: Docker, NVIDIA Container Toolkit (nvidia-docker2) and NVIDIA Device Plugin.

In this blog, we will just use DeepOps to do above work by following https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster.

So basically we just need to replace section #4 of previous blog with below steps. (So here let me use step 4 as a starting point.)

4.1 Download DeepOps repo

On the EC2 machine:

git clone https://github.com/NVIDIA/deepops.git
cd deepops \
   && git checkout tags/20.10

4.2 Install ansible and other needed software

./scripts/setup.sh

4.3 Edit inventory and add nodes to the "KUBERNETES" section

vi config/inventory

Note: Since this is a single-node cluster, we need to add the same `hostname` to [kube-master], [etcd] and [kube-node] section.

4.4 Verify the configuration

ansible all -m raw -a "hostname"

4.5 Install Kubernetes using Ansible and Kubespray.

ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml

4.6 Test K8s cluster

kubectl get nodes
kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 --limits=nvidia.com/gpu=1 nvidia-smi

Issues:

1. There are 2 CoreDNS PODs with 1 POD pending

# kubectl get pods -A |grep coredns
kube-system              coredns-123                                0/1     Pending   0          2m40s
kube-system              coredns-456                                1/1     Running   0          64m

If we describe this pending POD, we got to know this is due to pod affinity/anti-affinity since we have only 1 node in this K8s cluster.

# kubectl describe pod coredns-123 -n kube-system  |grep affinity
  Warning  FailedScheduling  73s   default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.
  Warning  FailedScheduling  73s   default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.

CoreDNS deployment have 2 desired PODs:

# kubectl describe deployment.apps -n kube-system coredns |grep desired
Replicas:               2 desired | 2 updated | 2 total | 1 available | 1 unavailable

One way to resolve this in my first thought is to manually scale down deployment CoreDNS as below:

kubectl scale deployments.apps -n kube-system coredns --replicas=1

However it did not work.

The reason is by default, deployment dns-autoscaler is also installed, so the final fix is to:

kubectl edit configmap dns-autoscaler --namespace=kube-system

In above configMap, change "min":2 to "min":1.

After that, if you describe CoreDNS again, it will show it got scaled down to 1:

# kubectl describe deployment.apps -n kube-system coredns
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
  Normal  ScalingReplicaSet  21s (x2 over 12m)  deployment-controller  Scaled down replica set coredns-xxx to 1

Eventually you can delete the pending coreDNS pod if it is still there:

kubectl delete pods coredns-123 -n kube-system

2. CoreDNS pod crashed with the reason as "OOMKilled"

If we describe the crashed POD, we can get below reason:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 02 Apr 2021 21:32:12 +0000
      Finished:     Fri, 02 Apr 2021 21:32:21 +0000
    Ready:          False
    Restart Count:  3
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi

This is because by default, CoreDNS POD has 170MB memory limit which may be too small for big cluster. Here are some reported occurrence as well.

The fix is straightforward, just increase the deployment CoreDNS' resource limit:

kubectl set resources deployment.v1.apps/coredns --limits=cpu=1000m,memory=1024Mi

3. Spark on Kubernetes Job in client mode keeps failing

The Spark Driver may keep printing below messages:

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.

The Spark Executor may keeps crashing and restarting, but if we use "kubectl logs" to check the Executor POD, we will get the root cause:

Caused by: java.net.UnknownHostException: ip-xxx-xxx-xxx-xxx.cluster.local

It means the POD can not resolve the hostname of the node.

If we spin-off a "busybox" POD to test DNS to troubleshoot:

a. Create busybox.yaml with below content:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
  namespace: default
spec:
  containers:
  - image: busybox
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    name: busybox
  restartPolicy: Always

b. Test the DNS resolution in the sample "busybox" POD:

kubectl create -f busybox.yaml
kubectl exec -ti  busybox -- cat /etc/resolv.conf
kubectl exec -ti  busybox -- nslookup ip-xxx-xxx-xxx-xxx.cluster.local

We will get to know that both /etc/resolv.conf has default DNS server as "169.254.25.10" which can not resolve the hostname -f of the machine.

So what is this IP 169.254.25.10?

As we know by default, kubespray enables nodelocal dns cache with default IP as 169.254.25.10.

So it creates a new IP address for this machine if you check "ifconfig":

# ifconfig -a |grep 169.254.25.10
        inet 169.254.25.10  netmask 255.255.255.255  broadcast 169.254.25.10

# ps -ef|grep 169.254.25.10|grep -v grep
root     111 222  0 xx:xx ?        00:00:45 /node-cache -localip 169.254.25.10 -conf /etc/coredns/Corefile -upstreamsvc coredns

# kubectl get pods -A |grep nodelocaldns
kube-system              nodelocaldns-xxxxx                                      1/1     Running   0          161m

Eventually I found out the root cause:

The hostname and hostname -f on the EC2 machine return different results:

hostname returns "ip-xxx-xxx-xxx-xxx.ec2.internal" however hostname -f returns "ip-xxx-xxx-xxx-xxx.cluster.local".

This is because below entry was added by Ansible in /etc/hosts:

# Ansible inventory hosts BEGIN
xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.cluster.local ip-xxx-xxx-xxx-xxx ip-xxx-xxx-xxx-xxx.ec2.internal.cluster.local ip-xxx-xxx-xxx-xxx.ec2.internal

After removing above entries from /etc/hosts, hostname and hostname -f are matched now -- "ip-xxx-xxx-xxx-xxx.ec2.internal".

Basically we just let DNS server to resolve the hostname.

Now the spark on kubernetes job in client mode works fine.

How to install a Kubernetes Cluster with NVIDIA GPU on AWS

2021-03-30T22:07:00.007-07:00

Goal:

This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS.

It includes spinning up an AWS EC2 instance, installing NVIDIA drivers&cudatoolkit, installing Kubernetes Cluster with GPU support, and eventually ran a Spark+Rapids job to test it.

Env:

AWS EC2 (G4dn)

Ubuntu 18.04

Solution:

1. Spin up an AWS EC2 instance with NVIDIA GPU

Here I choose "Ubuntu Server 18.04 LTS (HVM), SSD Volume Type" base image.

Choose "Instance Type": g4dn.2xlarge (8vCPU, 32G memory, 1x 225 SSD).

Note: EC2 G4dn instance has NVIDIA T4 GPU(s) attached.

Go to "Step 3: Configure Instance Details": Auto-assign Public IP=Enable.

Go to "Step 4: Add Storage": Increase the Root Volume from default 8G to 200G.

Go to "Step 6: Configure Security Group": Create a security group with ssh only allowed from your public IP address.

Eventually "Launch" and select an existing key pair or create a new key pair.

2. SSH to the EC2 instance

Please follow the Doc on how to ssh to EC2 instance.

ssh -i /path/my-key-pair.pem ubuntu@my-instance-public-dns-name
sudo su - root

3. Install NVIDIA Driver and cudatoolkit

Please follow this blog on How to intall CUDA Toolkit and NVIDIA Driver on Ubuntu (step by step).

Make sure "nvidia-smi" returns correct results.

Below is a lazy-man's script to install CUDA 11.0.3 with NVIDIA Driver 450.51.06 on ubuntu x86-64 run by root user after you logon this EC2 machine:

(Note: Please validate it carefully yourself!)

apt-get update
apt install -y gcc
apt-get install -y linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
apt-get update
apt-get install -y cuda
printf "export PATH=/usr/local/cuda/bin\${PATH:+:\${PATH}}\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64{LD_LIBRARY_PATH:+:\${LD_LIBRARY_PATH}}" >> ~/.bashrc
nvidia-smi

4. Install a Kubernetes Cluster with NVIDIA GPU

Please follow this NVIDIA Doc on how to install a Kubernetes Cluster with NVIDIA GPU attached.

Here I choose to use "Option 2" which is to use kubeadm.

4.1 Install Docker

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

4.2 Install kubeadm

Please follow this K8s Doc on how to install kubeadm.

4.3 Init a Kubernetes Cluster

kubeadm init --pod-network-cidr=192.168.0.0/16

Then follow the printed steps in the end to start using the cluster.

4.4 Configure network

kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-

4.5 Check the Nodes which should be in "Ready" status

# kubectl get nodes
NAME               STATUS   ROLES                  AGE   VERSION
ip-xxx-xxx-xx-xx   Ready    control-plane,master   11m   v1.20.5

4.6 Install NVIDIA Container Toolkit (nvidia-docker2)

Setup the stable repository for the NVIDIA runtime and the GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Then install nvidia-docker2 package and its dependencies:

sudo apt-get update \
   && sudo apt-get install -y nvidia-docker2

Add "default-runtime" set to "nvidia" into /etc/docker/daemon.json:

{
   "default-runtime": "nvidia",
   "runtimes": {
      "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Restart Docker daemon:

sudo systemctl restart docker

Test a base CUDA container:

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

4.7 Install NVIDIA Device Plugin

Firstly install helm which is the preferred option:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Add the nvidia-device-plugin helm repository:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update

Deploy the device plugin:

helm install --generate-name nvdp/nvidia-device-plugin

Check current running PODs to make sure nvidia-device-plugin-xxx POD is running:

kubectl get pods -A

4.8 Test CUDA job

Create gpu-pod.yaml with below content:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda10.2"
    resources:
      limits:
         nvidia.com/gpu: 1

Deploy this sample POD:

kubectl apply -f gpu-pod.yaml

After the POD completes successfully, check the logs to double confirm:

# kubectl logs gpu-operator-test
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

5. Test a Spark+Rapids on K8s job

Please follow this Doc on Getting Started with RAPIDS and Kubernetes.

Please also refer to Spark on K8s Doc to get familiar with the basics.

For example, here we assume you know how to create service account and assign proper role to that service account.

5.1 Create a service account named "spark" to run spark jobs

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

5.2 Capture the cluster-info

kubectl cluster-info

Take the notes of the "Kubernetes control plane" URL which will be used in spark job.

5.3 Run sample spark jobs

Follow all the steps in Getting Started with RAPIDS and Kubernetes to run sample Spark job in cluster or client mode.

Here we are using "spark" service account to run the Spark jobs with below extra option:

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

References:

https://spark.apache.org/docs/latest/running-on-kubernetes.html

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html

concat_ws example on Spark with RAPIDS Accelerator

2021-03-25T14:34:00.008-07:00

Goal:

This is a quick example of operator contact_ws on Spark with RAPIDS Accelerator.

Env:

Spark 3.1.1

RAPIDS Accelerator for Apache Spark 0.4.1

Solution:

1. concat_ws can convert an Array of Strings to a String with a separator.

Below is a quick example using scala:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, ArrayType}

val data = Seq(
    Row(1, List("orange", "banana", "apple")),
    Row(2, List("a", "b", "c"))
)

val schema = StructType(Array(
    StructField("idx",IntegerType,true),
    StructField("arrays",ArrayType(StringType),true)
))

val df = spark.createDataFrame( spark.sparkContext.parallelize(data),schema )
val df2 = df.withColumn("concat_array", concat_ws(",",col("arrays")))
df2.show()
df2.explain()

The output with RAPIDS Accelerator for Apache Spark 0.4.1 is :

scala> df2.show

+---+--------------------+-------------------+
|idx|              arrays|       concat_array|
+---+--------------------+-------------------+
|  1|[orange, banana, ...|orange,banana,apple|
|  2|           [a, b, c]|              a,b,c|
+---+--------------------+-------------------+


scala> df2.explain()
21/03/25 21:01:48 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <AttributeReference> idx#2 could run on GPU
  @Expression <AttributeReference> arrays#3 could run on GPU
  @Expression <Alias> concat_ws(,, arrays#3) AS concat_array#15 could run on GPU
    !NOT_FOUND <ConcatWs> concat_ws(,, arrays#3) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ConcatWs could be found
      @Expression <Literal> , could run on GPU
      @Expression <AttributeReference> arrays#3 could run on GPU
  !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
    @Expression <AttributeReference> idx#2 could run on GPU
    @Expression <AttributeReference> arrays#3 could run on GPU

== Physical Plan ==
*(1) Project [idx#2, arrays#3, concat_ws(,, arrays#3) AS concat_array#15]
+- *(1) Scan ExistingRDD[idx#2,arrays#3]

As you can see, concat_ws is not supported on RAPIDS Accelerator 0.4.1 since it falls back to CPU.

2. concat_ws can concatenate multiple columns together with a separator.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val data = Seq(
    Row(1, "orange", "banana", "apple"),
    Row(2, "a", "b", "c")
)

val schema = StructType(Array(
    StructField("idx",IntegerType,true),
    StructField("s1",StringType,true),
    StructField("s2",StringType,true),
    StructField("s3",StringType,true)
))

val df = spark.createDataFrame( spark.sparkContext.parallelize(data),schema )
val df2 = df.withColumn("concat_array", concat_ws(",",col("idx"), col("s1"), col("s2"), col("s3") ))
df2.show()
df2.explain()

The output with RAPIDS Accelerator for Apache Spark 0.4.1 is :

scala> df2.show

+---+------+------+-----+--------------------+
|idx|    s1|    s2|   s3|        concat_array|
+---+------+------+-----+--------------------+
|  1|orange|banana|apple|1,orange,banana,a...|
|  2|     a|     b|    c|             2,a,b,c|
+---+------+------+-----+--------------------+


scala> df2.explain()
21/03/25 21:19:11 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <AttributeReference> idx#65 could run on GPU
  @Expression <AttributeReference> s1#66 could run on GPU
  @Expression <AttributeReference> s2#67 could run on GPU
  @Expression <AttributeReference> s3#68 could run on GPU
  @Expression <Alias> concat_ws(,, cast(idx#65 as string), s1#66, s2#67, s3#68) AS concat_array#73 could run on GPU
    !NOT_FOUND <ConcatWs> concat_ws(,, cast(idx#65 as string), s1#66, s2#67, s3#68) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ConcatWs could be found
      @Expression <Literal> , could run on GPU
      @Expression <Cast> cast(idx#65 as string) could run on GPU
        @Expression <AttributeReference> idx#65 could run on GPU
      @Expression <AttributeReference> s1#66 could run on GPU
      @Expression <AttributeReference> s2#67 could run on GPU
      @Expression <AttributeReference> s3#68 could run on GPU
  !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
    @Expression <AttributeReference> idx#65 could run on GPU
    @Expression <AttributeReference> s1#66 could run on GPU
    @Expression <AttributeReference> s2#67 could run on GPU
    @Expression <AttributeReference> s3#68 could run on GPU

== Physical Plan ==
*(1) Project [idx#65, s1#66, s2#67, s3#68, concat_ws(,, cast(idx#65 as string), s1#66, s2#67, s3#68) AS concat_array#73]
+- *(1) Scan ExistingRDD[idx#65,s1#66,s2#67,s3#68]

Same here concat_ws is not supported on RAPIDS Accelerator 0.4.1 since it falls back to CPU.

Let's compare this scenario to a concat operator:

val df3 = df.withColumn("concat_array", concat(col("idx"), lit(','), col("s1"), lit(','), col("s2"), lit(','), col("s3") ))
df3.show()
df3.explain()

Output for concat with RAPIDS Accelerator for Apache Spark 0.4.1 is :

scala> df3.show()

+---+------+------+-----+--------------------+
|idx|    s1|    s2|   s3|        concat_array|
+---+------+------+-----+--------------------+
|  1|orange|banana|apple|1,orange,banana,a...|
|  2|     a|     b|    c|             2,a,b,c|
+---+------+------+-----+--------------------+


scala> df3.explain()
21/03/25 21:26:28 WARN GpuOverrides:
  !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
    @Expression <AttributeReference> idx#65 could run on GPU
    @Expression <AttributeReference> s1#66 could run on GPU
    @Expression <AttributeReference> s2#67 could run on GPU
    @Expression <AttributeReference> s3#68 could run on GPU

== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [idx#65, s1#66, s2#67, s3#68, gpuconcat(cast(idx#65 as string), ,, s1#66, ,, s2#67, ,, s3#68) AS concat_array#100]
   +- GpuRowToColumnar TargetSize(2147483647)
      +- *(1) Scan ExistingRDD[idx#65,s1#66,s2#67,s3#68]

Since concat is a supported operator, as you can see above, it is running on GPU using "GpuProject".

In this scenario, if you want, you can use concat to rewrite conact_ws to make it run on GPU in RAPIDS Accelerator 0.4.1 version.

Note: above tests are based on RAPIDS Accelerator 0.4.1. Future versions should have more supported operators.

For supported operators in RAPIDS Accelerator, please always refer to this RAPIDS Accelerator Doc.

Hands-on native cuDF Pandas UDF

2021-03-24T14:50:00.012-07:00

Goal:

This article will help show some hands-on steps to play with native cuDF Pandas UDF on Spark with RAPIDS Accelerator for Apache Spark.

Env:

RAPIDS Accelerator for Apache Spark 0.4.1

Spark 3.1.1

RTX 6000 GPU

Concept:

As we know, Spark introduced Pandas UDFs (a.k.a. Vectorized UDFs) feature in the Spark 2.3 and brings huge performance gains.

Here we will introduce the native cuDF version Pandas UDF(which can run on GPU natively) with RAPIDS Accelerator for Apache Spark enabled for Spark.

Below parameters controls this behavior:

spark.rapids.python.concurrentPythonWorkers : Number of Python worker processes that can execute concurrently per GPU.
spark.rapids.python.memory.gpu.allocFraction : The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers.
spark.rapids.python.memory.gpu.maxAllocFraction : The fraction of total GPU memory that limits the maximum size of the RMM pool for all the Python workers.
spark.rapids.python.memory.gpu.pooling.enabled : Should RMM in Python workers act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly.

If we enable this feature, Python worker processes will share and allocate the GPU memory with Spark Executors. Please read this post for more details on GPU pool memory allocation for Spark+RAPIDS.

As a result, we need to divide the GPU memory between Spark Executors and Python worker process.

Here I am allocating 40% GPU memory for Python workers and 50% for Spark Executors by setting below in spark-defaults.conf:

spark.rapids.sql.python.gpu.enabled true
spark.rapids.memory.gpu.allocFraction 0.5
spark.rapids.python.memory.gpu.allocFraction 0.4
spark.rapids.python.memory.gpu.maxAllocFraction 0.4

And then I decide to spin off 2 concurrent Python workers:

spark.rapids.python.concurrentPythonWorkers 2

Since RTX 6000 has 24G GPU memory, after that, when Python workers are running, you may see below DEBUG log from Executor log:

DEBUG: Pooled memory, pool size: 4844.0625 MiB, max size: 8796093022208.0 MiB

This means, 24G * 0.4 / 2 = 4.8G.

Note: Since the default spark.rapids.memory.gpu.allocFraction=0.9, if we did not do the memory allocation properly, you may hit below error in some tasks' logs:

MemoryError: std::bad_alloc: RMM failure at:/home/xxx/xxx/envs/rapids-0.18/include/rmm/mr/device/pool_memory_resource.hpp:188: Maximum pool size exceeded

Solution:

1. Python dependency is cuDF

Make sure you installed cuDF library in your python env.

You can follow this rapids.ai started guide to install the libraries in your conda env on all nodes.

For example:

conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
    -c defaults cudf=0.18 python=3.8 cudatoolkit=11.0

Note: If you can not install cuDF library on all nodes due to some reason, then you may need to package the whole conda env and distribute it to all Spark Executors which could be very time consuming. For example, in this post I used this way to do run the test framework.

After that, make sure the python for pyspark is pointing to the correct conda env by setting PYSPARK_PYTHON in spark-env.sh on all nodes:

export PYSPARK_PYTHON=/xxx/xxx/MYGLOBALENV/rapids-0.18/bin/python

2. RAPIDS Accelerator for Apache Spark is setup properly

I am assuming you have set RAPIDS Accelerator for Apache Spark related parameters properly and RAPIDS Accelerator for Apache Spark is working fine already.

Especially, the spark.driver.extraJavaOptions, spark.executor.extraJavaOptions should use UTC JVM timezone as per this post.

spark.executor.extraClassPath and spark.driver.extraClassPath should include the cudf jar and rapids-4-spark jar.

3. Launch pyspark and test different kinds of UDFs

pyspark --conf spark.executorEnv.PYTHONPATH="/home/xxx/spark/rapids/rapids-4-spark_2.12-0.4.1.jar"

Here make sure you specify the correct jar path for rapids-4-spark jar.

Import needed python libs and create a sample dataframe:

import pyspark
from pyspark.sql.functions import udf
from pyspark.sql.functions import pandas_udf, PandasUDFType
import cudf
import pandas as pd

# Prepare sample data
small_data = [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)]
df = spark.createDataFrame(small_data, ("id", "v"))

3.a row-at-a-time UDF

# Use udf to define a row-at-a-time udf
@udf('double')
# Input/output are both a single double value
def plus_one(v):
      return v + 1

df.withColumn('v2', plus_one(df.v)).show()
df.withColumn('v2', plus_one(df.v)).explain()

Output:

21/03/24 17:58:42 WARN GpuOverrides:
  !NOT_FOUND <BatchEvalPythonExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.python.BatchEvalPythonExec could be found
    @Expression <PythonUDF> plus_one(v#1) could not block GPU acceleration
      @Expression <AttributeReference> v#1 could run on GPU
    @Expression <AttributeReference> pythonUDF0#28 could run on GPU
    !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
      @Expression <AttributeReference> id#0L could run on GPU
      @Expression <AttributeReference> v#1 could run on GPU

== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [id#0L, v#1, pythonUDF0#28 AS v2#24]
   +- GpuRowToColumnar TargetSize(2147483647)
      +- BatchEvalPython [plus_one(v#1)], [pythonUDF0#28]
         +- *(1) Scan ExistingRDD[id#0L,v#1]

As we can see, "BatchEvalPython" is not running on GPU.

3.b Pandas UDF

To test the query plan or performance, we need to disable above cuDF Pandas UDF related parameters such as spark.rapids.sql.python.gpu.enabled.

# Use pandas_udf to define a Pandas UDF
@pandas_udf('double', PandasUDFType.SCALAR)
# Input/output are both a pandas.Series of doubles
def pandas_plus_one(v: pd.Series) -> pd.Series:
    return v + 1

df.withColumn('v2', pandas_plus_one(df.v)).show()
df.withColumn('v2', pandas_plus_one(df.v)).explain()

Output:

== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [id#102L, v#103, pythonUDF0#136 AS v2#132]
   +- GpuCoalesceBatches TargetSize(2147483647)
      +- GpuArrowEvalPython [pandas_plus_one(v#103)], [pythonUDF0#136], 200
         +- GpuRowToColumnar TargetSize(2147483647)
            +- *(1) Scan ExistingRDD[id#102L,v#103]

As we can see, it is done by GpuArrowEvalPython.

From Spark Executor log, "PythonUDFRunner" is started to do the work.

When the job is running, the python daemon processes are "pyspark.daemon":

python -m pyspark.daemon
...
python -m pyspark.daemon

3.c cuDF Pandas UDF

@pandas_udf('double')
def cudf_pandas_plus_one(v: pd.Series) -> pd.Series:  
  gpu_series = cudf.Series(v)
  gpu_series = gpu_series + 1
  return gpu_series.to_pandas()

df.withColumn('v2', cudf_pandas_plus_one(df.v)).show()
df.withColumn('v2', cudf_pandas_plus_one(df.v)).explain()

Output:

== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [id#0L, v#1, pythonUDF0#74 AS v2#70]
   +- GpuCoalesceBatches TargetSize(2147483647)
      +- GpuArrowEvalPython [cudf_pandas_plus_one(v#1)], [pythonUDF0#74], 200
         +- GpuRowToColumnar TargetSize(2147483647)
            +- *(1) Scan ExistingRDD[id#0L,v#1]

As we can see, it is done by GpuArrowEvalPython. The same plan as above 3.b.

From Spark Executor log, "GpuArrowPythonRunner" is started to do the work.

When the job is running, the python daemon processes are "rapids.daemon":

python -m rapids.daemon
...
python -m rapids.daemon

For more types of native cuDF pandas UDF, please refer to this test python code.

References:

https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark

2021-03-23T23:51:00.004-07:00

Goal:

How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark.

Env:

RAPIDS Accelerator for Apache Spark 0.4

Spark 3.1.1

Solution:

1. Compile RAPIDS Accelerator for Apache Spark

1.a Create a conda env for compiling

conda create -n cudftest -c conda-forge python=3.8 pytest pandas pyarrow sre_yield pytest-xdist findspark

Here I decide to use one conda env "cudftest" for compiling and use another conda env named "rapids-0.18" to test the cudf_udf in Spark.

Of course you can choose to use one conda env if you want but it may include too many python packages in the end.

I just want to keep the conda env "rapids-0.18" to be as small as possible because eventually I need to distribute it to all Executors in Spark cluster.

1.b Compile from source code

cd ~/github/spark-rapids
# git checkout v0.4.0
mvn clean install -DskipTests

You can decide which version to compile. Here I am going to compile the 0.15-snapshot which is the current main branch. The current GA release is 0.4 though.

2. Run pandas cudf_udf Tests

Please follow this Doc on how to enable the pandas cudf_udf tests.

Basically pandas cudf_udf tests are inside "./integration_tests/runtests.py" with option "--cudf_udf".

The key is to make sure the all the python envs and needed jar file paths are correct.

2.a Create a conda env for running cudf_udf tests

Please follow the steps mentioned in rapids.ai to create the conda env with cudf installed.

For example:

conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
    -c defaults cudf=0.18 python=3.7 cudatoolkit=11.0

2.b Install needed python packages needed by cudf_udf tests

conda activate rapids-0.18
conda install pandas

2.c Package your conda env

You can refer to this blog on how to package your conda env for spark job.

cd /home/xxx/miniconda3/envs
zip -r rapids-0.18.zip rapids-0.18/
mv rapids-0.18.zip ~/
cd ~/ && mkdir MYGLOBALENV
cd MYGLOBALENV/ && ln -s /home/xxx/miniconda3/envs/rapids-0.18/ rapids-0.18
cd ..
export PYSPARK_PYTHON=./MYGLOBALENV/rapids-0.18/bin/python

2.d Run the pandas cudf_udf tests

cd /home/xxx/github/spark-rapids/integration_tests 
PYSPARK_PYTHON=/home/xxx/MYGLOBALENV/rapids-0.18/bin/python $SPARK_HOME/bin/spark-submit --jars "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/github/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/spark/rapids/cudf.jar,/home/xxx/github/spark-rapids/tests/target/rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" \
                             --conf spark.rapids.memory.gpu.allocFraction=0.3 \
                             --conf spark.rapids.python.memory.gpu.allocFraction=0.3 \
                             --conf spark.rapids.python.concurrentPythonWorkers=2 \
                             --py-files "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
                             --conf spark.executorEnv.PYTHONPATH="/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar"  \
                             --conf spark.executorEnv.PYSPARK_PYTHON=/home/xxx/rapids-0.18/bin/python \
                             --archives /home/xxx/rapids-0.18.zip#MYGLOBALENV \
                             ./runtests.py -m "cudf_udf" -v -rfExXs --cudf_udf

Note1: Make sure all jar paths are correct.

Note2: Here I am using spark standalone cluster, that is why I used spark.executorEnv.PYSPARK_PYTHON. For Spark on YARN, you need to use corresponding parameters such as spark.yarn.appMasterEnv.PYSPARK_PYTHON .

Note3: Make sure $SPARK_HOME is set and also the spark cluster is working fine with Rapids for Spark enabled.

The expected result is: PASSED [100%].

Reference:

http://alkaline-ml.com/2018-07-02-conda-spark/

Understanding RAPIDS Accelerator For Apache Spark's supported timezone

2021-03-19T23:24:00.002-07:00

Goal:

This article explains the current supported timezone for "RAPIDS Accelerator For Apache Spark".

Env:

RAPIDS Accelerator For Apache Spark 0.4

Concept:

As per current 0.4 Doc mentions:

operations involving timestamps will only be GPU-accelerated if the time zone used by the JVM is UTC.

It means if the JVM timezone of the Spark job is not UTC, the operations involving timestamp will be fallback to CPU which result in performance overhead.

Here it includes non-supported and supported timestamp format conversion.

Note: supported timestamp formats are documented in this Compatibility doc.

Test:

Below Spark Cluster nodes are using PST timezone.

1. PST JVM timezone + supported timestamp format

Let's start a spark-shell without any JVM timezone change and run below timestamp conversion on supported format:

scala> val df_supported = Seq(("2021-12-25 11:11:11")).toDF("ts")
df_supported: org.apache.spark.sql.DataFrame = [ts: string]

scala> df_supported.write.format("parquet").mode("overwrite").save("/tmp/testts_supported.parquet")
21/03/19 21:58:29 WARN GpuOverrides:
  !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found
    @Expression <AttributeReference> ts#4 could run on GPU


scala> spark.read.parquet("/tmp/testts_supported.parquet").createOrReplaceTempView("df_supported")

scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").explain
21/03/19 21:58:31 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because unsupported data types in output: TimestampType; not all expressions can be replaced
  !Expression <Alias> gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9 cannot run on GPU because expression Alias gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9 produces an unsupported type TimestampType; expression GetTimestamp gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) produces an unsupported type TimestampType
    !Expression <GetTimestamp> gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because expression GetTimestamp gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) produces an unsupported type TimestampType
      @Expression <AttributeReference> ts#7 could run on GPU
      @Expression <Literal> yyyy-MM-dd HH:mm:ss could run on GPU

== Physical Plan ==
*(1) Project [gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9]
+- GpuColumnarToRow false
   +- GpuFileGpuScan parquet [ts#7] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testts_supported.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ts:string>



scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").show
21/03/19 21:58:31 WARN GpuOverrides:
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) as string) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#14 could run on GPU
      @Expression <Cast> cast(gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) as string) could run on GPU
        !Expression <GetTimestamp> gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because expression GetTimestamp gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) produces an unsupported type TimestampType
          @Expression <AttributeReference> ts#7 could run on GPU
          @Expression <Literal> yyyy-MM-dd HH:mm:ss could run on GPU

+-------------------------------------+
|to_timestamp(ts, yyyy-MM-dd HH:mm:ss)|
+-------------------------------------+
|                  2021-12-25 11:11:11|
+-------------------------------------+

As you can see above, the operation "to_timestamp" fallback to CPU mode with the keyword in the query plan -- "Project".

From Spark UI's query plan, we can see "GpuColumnarToRow" and "GpuRowToColumnar".

This indicates performance overhead since data is moved between GPU and CPU:

2. UTC JVM timezone + supported timestamp format

To make supported timestamp operation work, we do not need to change the timezone of the machines if the machine timezone is not UTC.

We just need to change the JVM timezone for driver and executor.

The method is described in this Doc:

spark.driver.extraJavaOptions should include -Duser.timezone=UTC
spark.executor.extraJavaOptions should include -Duser.timezone=UTC
spark.sql.session.timeZone=UTC

Then run the same tests in spark-shell after changing JVM timezone to UTC:

spark-shell --conf spark.sql.session.timeZone=UTC --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"

scala> val df_supported = Seq(("2021-12-25 11:11:11")).toDF("ts")
df_supported: org.apache.spark.sql.DataFrame = [ts: string]

scala> df_supported.write.format("parquet").mode("overwrite").save("/tmp/testts_supported.parquet")
21/03/20 06:11:56 WARN GpuOverrides:
  !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found
    @Expression <AttributeReference> ts#4 could run on GPU


scala> spark.read.parquet("/tmp/testts_supported.parquet").createOrReplaceTempView("df_supported")

scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [gpugettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, yyyy-MM-dd HH:mm:ss, %Y-%m-%d %H:%M:%S, None) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9]
   +- GpuFileGpuScan parquet [ts#7] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testts_supported.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ts:string>



scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").show
+-------------------------------------+
|to_timestamp(ts, yyyy-MM-dd HH:mm:ss)|
+-------------------------------------+
|                  2021-12-25 11:11:11|
+-------------------------------------+

As you can see above, the operation "to_timestamp" now runs in GPU mode with the keyword in the query plan -- "GpuProject". Spark UI shows the same:

3. UTC JVM timezone + non-supported timestamp format

For non-supported timestamp format, it will still fallback to CPU mode.

For example: "MMM" is not supported in 0.4.

scala> val df_notsupported = Seq(("2021-Dec-25 11:11:11")).toDF("ts")
df_notsupported: org.apache.spark.sql.DataFrame = [ts: string]

scala> df_notsupported.write.format("parquet").mode("overwrite").save("/tmp/testts_notsupported.parquet")
21/03/20 06:15:49 WARN GpuOverrides:
  !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found
    @Expression <AttributeReference> ts#22 could run on GPU


scala> spark.read.parquet("/tmp/testts_notsupported.parquet").createOrReplaceTempView("df_notsupported")

scala> spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").explain
21/03/20 06:15:50 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)#27 could run on GPU
    !Expression <GetTimestamp> gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because incompatible format 'yyyy-MMM-dd HH:mm:ss'. Set spark.rapids.sql.incompatibleDateFormats.enabled=true to force onto GPU.
      @Expression <AttributeReference> ts#25 could run on GPU
      @Expression <Literal> yyyy-MMM-dd HH:mm:ss could run on GPU

== Physical Plan ==
*(1) Project [gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)#27]
+- GpuColumnarToRow false
   +- GpuFileGpuScan parquet [ts#25] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testts_notsupported.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ts:string>



scala> spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").show
21/03/20 06:15:51 WARN GpuOverrides:
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) as string) AS to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)#32 could run on GPU
      @Expression <Cast> cast(gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) as string) could run on GPU
        !Expression <GetTimestamp> gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because incompatible format 'yyyy-MMM-dd HH:mm:ss'. Set spark.rapids.sql.incompatibleDateFormats.enabled=true to force onto GPU.
          @Expression <AttributeReference> ts#25 could run on GPU
          @Expression <Literal> yyyy-MMM-dd HH:mm:ss could run on GPU

+--------------------------------------+
|to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)|
+--------------------------------------+
|                   2021-12-25 11:11:11|
+--------------------------------------+

Below are test code for pyspark users:

from pyspark.sql.functions import to_timestamp
from pyspark.sql import Row
df_supported=sc.parallelize([Row(ts='2021-12-25 11:11:11')]).toDF()
df_supported.write.format("parquet").mode("overwrite").save("/tmp/testts_supported.parquet")
spark.read.parquet('/tmp/testts_supported.parquet').createOrReplaceTempView("df_supported")
spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").explain()
spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").show()

df_notsupported=sc.parallelize([Row(ts='2021-Dec-25 11:11:11')]).toDF()
df_notsupported.write.format("parquet").mode("overwrite").save("/tmp/testts_notsupported.parquet")
spark.read.parquet('/tmp/testts_notsupported.parquet').createOrReplaceTempView("df_notsupported")
spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").explain()
spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").show()

Note: there is one parameter "spark.rapids.sql.incompatibleDateFormats.enabled" which does below:

"When parsing strings as dates and timestamps in functions like unix_timestamp, setting this to true will force all parsing onto GPU even for formats that can result in incorrect results when parsing invalid inputs."

Spark Tuning -- Adaptive Query Execution(3): Dynamically optimizing skew joins

2021-03-18T22:50:00.003-07:00

Goal:

This article explains Adaptive Query Execution (AQE)'s "Dynamically optimizing skew joins" feature introduced in Spark 3.0.

This is a follow up article for Spark Tuning -- Adaptive Query Execution(1): Dynamically coalescing shuffle partitions, and Spark Tuning -- Adaptive Query Execution(2): Dynamically switching join strategies.

Env:

Spark 3.0.2

Concept:

This article focuses on 3rd feature "Dynamically optimizing skew joins" in AQE.

As Spark Performance Tuning guide described:

This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks.

Below picture from databricks blog describes well:

Below 2 parameters determines a "skew partition". It has to meet both of below 2 conditions:

a. Its partition size > spark.sql.adaptive.skewJoin.skewedPartitionFactor (default=10) * "median partition size"
b. Its partition size > spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes (default = 256MB)

The source code of this feature is inside org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin .

Before doing below tests, we can enable log4j DEBUG for above java class so that it can help print the sizes of those partitions. For example, we can put below line in log4j.properties:

log4j.logger.org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin=DEBUG

And then ask executor to use this log4j file:

spark.executor.extraJavaOptions '-Dlog4j.configuration=$SPARK_HOME/conf/log4j.properties'

Solution:

As per databricks blog "Adaptive Query Execution: Speeding Up Spark SQL at Runtime", it has a pretty good demo notebook which I will use for the following tests. The query which contains skew data is:

use aqe_demo_db;

SELECT s_date, sum(s_quantity * i_price) AS total_sales
FROM sales
JOIN items ON i_item_id = s_item_id
GROUP BY s_date
ORDER BY total_sales DESC;

1. AQE off

This is default run without AQE. Query duration is 6.4min in my test lab.

Because data skew exists in "sales" table with "s_item_id=100"(80% of the data), the default run will result in a long running SortMergeJoin(SMJ).

One task in the Shuffle Phase will Shuffle Read 5.8GB data while other 199 tasks only read 14.3MB data in average. It also result in huge spilling on disk.

Spilling monitoring:

$  pwd
/tmp/spark-40b20c3b-7f04-4cc4-9134-7d64be53f919/executor-473da0f7-2d70-4565-91e7-ba5f3ea12a8a/blockmgr-78d16f8e-fc7b-4190-a08e-f96f57aabf97
$  find . -name *.*
.
./20/shuffle_2_91_0.index
./20/shuffle_2_219_0.data
./34/shuffle_2_170_0.index
./34/shuffle_2_192_0.index
...

Jstack on executor process also shows:

"Executor task launch worker for task 90.0 in stage 2.0 (TID 124)" #54 daemon prio=5 os_prio=0 cpu=222822.75ms elapsed=237.59s tid=0x00007f81000d2000 nid=0xe50 runnable  [0x00007f8150e18000]
   java.lang.Thread.State: RUNNABLE
   at net.jpountz.xxhash.XXHashJNI.XXH32_update(Native Method)
   at net.jpountz.xxhash.StreamingXXHash32JNI.update(StreamingXXHash32JNI.java:67)
   - locked <0x0000000735011230> (a net.jpountz.xxhash.StreamingXXHash32JNI)
   at net.jpountz.xxhash.StreamingXXHash32$1.update(StreamingXXHash32.java:119)
   at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:206)
   at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:176)
   at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:260)
   at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:136)
   at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:544)
   at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:228)
   at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:208)
   - locked <0x0000000581600ea8> (a org.apache.spark.memory.TaskMemoryManager)
   at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:289)
   at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:95)
   at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:361)
   at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:417)
   at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
   at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.findNextInnerJoinRows$(Unknown Source)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:774)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
   at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
   at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
   at org.apache.spark.scheduler.Task.run(Task.scala:131)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   at org.apache.spark.executor.Executor$TaskRunner$$Lambda$539/0x00000008404f9440.apply(Unknown Source)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.10/ThreadPoolExecutor.java:1128)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.10/ThreadPoolExecutor.java:628)
   at java.lang.Thread.run(java.base@11.0.10/Thread.java:834)

2. AQE on

set spark.sql.adaptive.enabled=true;

This is default run with AQE on. Query duration is 2.4min in my test lab.

Below debug log is printed which shows the skewed partition size is about 6GB, and AQE split it into 30 partitions:

Optimizing skewed join.
Left side partitions size info:
median size: 13972650, max size: 6517549080, min size: 13972650, avg size: 46499052
Right side partitions size info:
median size: 1549072, max size: 1549072, min size: 1549072, avg size: 1549072

DEBUG OptimizeSkewedJoin: Left side partition 23 (6 GB) is skewed, split it into 30 parts.
DEBUG OptimizeSkewedJoin: number of skewed partitions: left 1, right 0

Extra "CustomShuffleReader" also shows skew partition information.

This stage has 81 partitions, which include 51 normal partitions + 30 skewed partitions.

It means, if AQE did not trigger this skew optimization, the original partition size should be 52. (Remember this number -- 52 because it will show up later.)

3. AQE on + increased spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=6517549081;

Query duration is 6.6min in my test lab.

Here I am trying to test spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes so I just set it to 1+<above max partition size>. The goal is to not trigger skew join optimization.

Below debug log is printed which means no skewed partition found.

DEBUG OptimizeSkewedJoin:
Optimizing skewed join.
Left side partitions size info:
median size: 13972650, max size: 6517549080, min size: 13972650, avg size: 46499052
Right side partitions size info:
median size: 1549072, max size: 1549072, min size: 1549072, avg size: 1549072

DEBUG OptimizeSkewedJoin: number of skewed partitions: left 0, right 0
DEBUG OptimizeSkewedJoin: OptimizeSkewedJoin rule is not applied due to additional shuffles will be introduced.

Now we can see "CustomShuffleReader" only spawns 52 partitions from UI:

4. GPU Mode with AQE off

Now let's try the same minimum query using Rapids for Spark Accelerator(current release 0.3) + Spark.

Query duration is 26s in my test lab. (Yes only 26s without AQE on!)

No debug log triggered since AQE is off.

GPU Mode will trigger GPU version ShuffleHashJoin(SHJ) which is super fast even without AQE:

There are only 2 partitions/tasks for shuffle stage.

From the Stage-20 metrics below we can see even though there is huge data skew, the skewed task only took 15s to compete. Thanks to Apache Arrow columnar memory format.

5. GPU Mode with AQE on

set spark.sql.adaptive.enabled = true;

Query duration is 25s in my test lab.

Below debug log shows smaller partition sizes under GPU mode comparing to CPU mode:

DEBUG OptimizeSkewedJoin:
Optimizing skewed join.
Left side partitions size info:
median size: 3645779055, max size: 6266874120, min size: 1024683990, avg size: 3645779055
Right side partitions size info:
median size: 112912836, max size: 112912836, min size: 112912836, avg size: 112912836

DEBUG OptimizeSkewedJoin: number of skewed partitions: left 0, right 0
DEBUG OptimizeSkewedJoin: OptimizeSkewedJoin rule is not applied due to additional shuffles will be introduced.

Here is because GPU mode does not have SMJ implemented yet as of today. So this AQE feature can not apply here. That is why you see no skewed partition found and it is still using GPU version ShuffleHashJoin.

However the query plan is a little different here, and AQE does spawns 2 extra "GpuCustomShuffleReader":

Reference:

What Dataset API is not supported for RAPIDS Accelerator for Apache Spark

2021-03-18T17:18:00.005-07:00

Goal:

This article explains what Dataset API is not supported for RAPIDS Accelerator for Apache Spark.

Env:

Spark 3.0.2

RAPIDS Accelerator for Apache Spark 0.3

Solution:

Currently RAPIDS Accelerator for Apache Spark does not support Dataset API but does support Dataframe API.

As we know, basically Dataframe is Dataset[ROW], then what does it mean?

In general the difference is that Dataset API can provide type-safety at compile time and also typed JVM objects comparing to Dataframe API.

If you are leveraging Dataset API's compile time error check feature, the operator may not be able to run on GPU.

Here is one easy example in spark-shell using scala:

1. Create a sample Dataset

import org.apache.spark.sql.Dataset
case class customer (
  c_customer_sk: Int,
  c_customer_id: String,
  c_current_cdemo_sk: Int,
  c_current_hdemo_sk: Int,
  c_current_addr_sk: Int
)

val df=spark.sql("select c_customer_sk,c_customer_id,c_current_cdemo_sk,c_current_hdemo_sk,c_current_addr_sk from tpcds.customer limit 10")
val ds: Dataset[customer] = df.as[customer]

2. Working on GPU

scala> ds.filter($"c_customer_sk" > 0).explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuFilter (gpuisnotnull(c_customer_sk#0) AND (c_customer_sk#0 > 0))
   +- GpuGlobalLimit 10
      +- GpuShuffleCoalesce 2147483647
         +- GpuColumnarExchange gpusinglepartitioning(), ENSURE_REQUIREMENTS, [id=#244]
            +- GpuLocalLimit 10
               +- GpuFileGpuScan parquet tpcds.customer[c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/tpcds_100G_parquet/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...

Here we specify the exact column name and as you can see Filter is running on GPU.

3. Not working on GPU

scala> ds.filter(_.c_customer_sk > 0).explain
== Physical Plan ==
*(1) Filter $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3763/0x00000008415be040@75f92fdc.apply
+- GpuColumnarToRow false
   +- GpuGlobalLimit 10
      +- GpuShuffleCoalesce 2147483647
         +- GpuColumnarExchange gpusinglepartitioning(), ENSURE_REQUIREMENTS, [id=#201]
            +- GpuLocalLimit 10
               +- GpuFileGpuScan parquet tpcds.customer[c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/tpcds_100G_parquet/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...

Here we are trying to access the column inside the typed JVM object at compile time, so the Filter can not run on GPU.

Above Filter is actually an opaque Lamda function in Catalyst plan.

But other operators like FileScan is running on GPU.

If we set spark.rapids.sql.explain=NOT_ON_GPU we can see the reasons:

!Exec <FilterExec> cannot run on GPU because not all expressions can be replaced
  !NOT_FOUND <Invoke> $line18.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3146/0x000000084137e840@f053608.apply cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.Invoke could be found
    !Expression <Literal> $line18.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3146/0x000000084137e840@f053608 cannot run on GPU because expression Literal $line18.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3146/0x000000084137e840@f053608 produces an unsupported type ObjectType(interface scala.Function1)
    !NOT_FOUND <NewInstance> newInstance(class $line15.$read$$iw$$iw$customer) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.NewInstance could be found
      !NOT_FOUND <AssertNotNull> assertnotnull(c_customer_sk#0) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found
        @Expression <AttributeReference> c_customer_sk#0 could run on GPU
      !NOT_FOUND <Invoke> c_customer_id#1.toString cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.Invoke could be found
        @Expression <AttributeReference> c_customer_id#1 could run on GPU
      !NOT_FOUND <AssertNotNull> assertnotnull(c_current_cdemo_sk#2) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found
        @Expression <AttributeReference> c_current_cdemo_sk#2 could run on GPU
      !NOT_FOUND <AssertNotNull> assertnotnull(c_current_hdemo_sk#3) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found
        @Expression <AttributeReference> c_current_hdemo_sk#3 could run on GPU
      !NOT_FOUND <AssertNotNull> assertnotnull(c_current_addr_sk#4) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found
        @Expression <AttributeReference> c_current_addr_sk#4 could run on GPU

Spark Tuning -- Adaptive Query Execution(2): Dynamically switching join strategies

2021-03-17T22:49:00.005-07:00

Goal:

This article explains Adaptive Query Execution (AQE)'s "Dynamically switching join strategies" feature introduced in Spark 3.0.

This is a follow up article for Spark Tuning -- Adaptive Query Execution(1): Dynamically coalescing shuffle partitions.

Env:

Spark 3.0.2

Concept:

This article focuses on 2nd feature "Dynamically switching join strategies" in AQE.

As Spark Performance Tuning guide described:

AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold.

This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)

Solution:

As per databricks blog "Adaptive Query Execution: Speeding Up Spark SQL at Runtime", it has a pretty good demo notebook which I will use for the following tests.

1. AQE off (default)

EXPLAIN cost
SELECT s_date, sum(s_quantity * i_price) AS total_sales
FROM sales
JOIN items ON s_item_id = i_item_id
WHERE i_price < 10
GROUP BY s_date
ORDER BY total_sales DESC;

The explain plan:

== Optimized Logical Plan ==
Sort [total_sales#10L DESC NULLS LAST], true, Statistics(sizeInBytes=368.1 PiB)
+- Aggregate [s_date#18], [s_date#18, sum(cast((s_quantity#17 * i_price#20) as bigint)) AS total_sales#10L], Statistics(sizeInBytes=368.1 PiB)
   +- Project [s_quantity#17, s_date#18, i_price#20], Statistics(sizeInBytes=368.1 PiB)
      +- Join Inner, (cast(s_item_id#16 as bigint) = i_item_id#19L), Statistics(sizeInBytes=589.0 PiB)
         :- Filter isnotnull(s_item_id#16), Statistics(sizeInBytes=3.7 GiB)
         :  +- Relation[s_item_id#16,s_quantity#17,s_date#18] parquet, Statistics(sizeInBytes=3.7 GiB)
         +- Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L)), Statistics(sizeInBytes=157.6 MiB)
            +- Relation[i_item_id#19L,i_price#20] parquet, Statistics(sizeInBytes=157.6 MiB)

== Physical Plan ==
*(7) Sort [total_sales#10L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(total_sales#10L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#109]
   +- *(6) HashAggregate(keys=[s_date#18], functions=[sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, total_sales#10L])
      +- Exchange hashpartitioning(s_date#18, 200), ENSURE_REQUIREMENTS, [id=#105]
         +- *(5) HashAggregate(keys=[s_date#18], functions=[partial_sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, sum#24L])
            +- *(5) Project [s_quantity#17, s_date#18, i_price#20]
               +- *(5) SortMergeJoin [cast(s_item_id#16 as bigint)], [i_item_id#19L], Inner
                  :- *(2) Sort [cast(s_item_id#16 as bigint) ASC NULLS FIRST], false, 0
                  :  +- Exchange hashpartitioning(cast(s_item_id#16 as bigint), 200), ENSURE_REQUIREMENTS, [id=#87]
                  :     +- *(1) Filter isnotnull(s_item_id#16)
                  :        +- *(1) ColumnarToRow
                  :           +- FileScan parquet aqe_demo_db.sales[s_item_id#16,s_quantity#17,s_date#18] Batched: true, DataFilters: [isnotnull(s_item_id#16)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [IsNotNull(s_item_id)], ReadSchema: struct<s_item_id:int,s_quantity:int,s_date:date>
                  +- *(4) Sort [i_item_id#19L ASC NULLS FIRST], false, 0
                     +- Exchange hashpartitioning(i_item_id#19L, 200), ENSURE_REQUIREMENTS, [id=#96]
                        +- *(3) Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L))
                           +- *(3) ColumnarToRow
                              +- FileScan parquet aqe_demo_db.items[i_item_id#19L,i_price#20] Batched: true, DataFilters: [isnotnull(i_price#20), (i_price#20 < 10), isnotnull(i_item_id#19L)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/items], PartitionFilters: [], PushedFilters: [IsNotNull(i_price), LessThan(i_price,10), IsNotNull(i_item_id)], ReadSchema: struct<i_item_id:bigint,i_price:int>

From the "Optimized Logical Plan", the estimated size of smaller side table "items" after filter "i_price<10" is 157.6MB which is larger than the default spark.sql.autoBroadcastJoinThreshold (10MB). As a result, a Sort Merge Join(SMJ) is chosen.

When we check the Spark UI after the query finishes, we found out that the actual size of the "smaller" join side is only 6.9MB which means the estimation is not very accurate:

As we know, normally the best performant join type is Broadcast Hash Join(BHJ) if one side is small enough to be broadcasted.

In this case, how can we let Spark be smart enough to change the plan to BHJ from SMJ at runtime? AQE is here to help us.

2. AQE on

set spark.sql.adaptive.enabled=true;

After AQE is turned on, the query plan would not change a lot except a sign "AdaptiveSparkPlan":

== Optimized Logical Plan ==
Sort [total_sales#35L DESC NULLS LAST], true, Statistics(sizeInBytes=368.1 PiB)
+- Aggregate [s_date#18], [s_date#18, sum(cast((s_quantity#17 * i_price#20) as bigint)) AS total_sales#35L], Statistics(sizeInBytes=368.1 PiB)
   +- Project [s_quantity#17, s_date#18, i_price#20], Statistics(sizeInBytes=368.1 PiB)
      +- Join Inner, (cast(s_item_id#16 as bigint) = i_item_id#19L), Statistics(sizeInBytes=589.0 PiB)
         :- Filter isnotnull(s_item_id#16), Statistics(sizeInBytes=3.7 GiB)
         :  +- Relation[s_item_id#16,s_quantity#17,s_date#18] parquet, Statistics(sizeInBytes=3.7 GiB)
         +- Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L)), Statistics(sizeInBytes=157.6 MiB)
            +- Relation[i_item_id#19L,i_price#20] parquet, Statistics(sizeInBytes=157.6 MiB)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [total_sales#35L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(total_sales#35L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#177]
      +- HashAggregate(keys=[s_date#18], functions=[sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, total_sales#35L])
         +- Exchange hashpartitioning(s_date#18, 200), ENSURE_REQUIREMENTS, [id=#174]
            +- HashAggregate(keys=[s_date#18], functions=[partial_sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, sum#44L])
               +- Project [s_quantity#17, s_date#18, i_price#20]
                  +- SortMergeJoin [cast(s_item_id#16 as bigint)], [i_item_id#19L], Inner
                     :- Sort [cast(s_item_id#16 as bigint) ASC NULLS FIRST], false, 0
                     :  +- Exchange hashpartitioning(cast(s_item_id#16 as bigint), 200), ENSURE_REQUIREMENTS, [id=#166]
                     :     +- Filter isnotnull(s_item_id#16)
                     :        +- FileScan parquet aqe_demo_db.sales[s_item_id#16,s_quantity#17,s_date#18] Batched: true, DataFilters: [isnotnull(s_item_id#16)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [IsNotNull(s_item_id)], ReadSchema: struct<s_item_id:int,s_quantity:int,s_date:date>
                     +- Sort [i_item_id#19L ASC NULLS FIRST], false, 0
                        +- Exchange hashpartitioning(i_item_id#19L, 200), ENSURE_REQUIREMENTS, [id=#167]
                           +- Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L))
                              +- FileScan parquet aqe_demo_db.items[i_item_id#19L,i_price#20] Batched: true, DataFilters: [isnotnull(i_price#20), (i_price#20 < 10), isnotnull(i_item_id#19L)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/items], PartitionFilters: [], PushedFilters: [IsNotNull(i_price), LessThan(i_price,10), IsNotNull(i_item_id)], ReadSchema: struct<i_item_id:bigint,i_price:int>

When the query is running, if we check the UI, it initially still shows SMJ.

As I mentioned in this post Spark Tuning -- explaining Spark SQL Join Types, SMJ actually has 3 steps -- shuffle, sort and merge.

So after shuffling is done, Spark realized that the smaller side of the join is actually 6.9MB which is smaller than default spark.sql.autoBroadcastJoinThreshold (10MB). As a result, AQE tells Spark to change the plan from SMJ to BHJ at runtime.

Since the shuffle is done already(otherwise, Spark won't know the real size of the smaller side), this is why the tuning guide says "This is not as efficient as planning a broadcast hash join in the first place".

But anyway, it avoids the rest steps of SMJ -- sort and merge, so it should still be better than a complete SMJ.

Since the shuffle writes is done, but the the rest steps are just a BHJ. Spark is smart enough to fetch the data from those shuffle files using a "local mode" since spark.sql.adaptive.localShuffleReader.enabled is true by default.

So from UI, you would find extra "CustomShuffleReader"s which are local mode to avoid network traffic:

Below graph is from this blog which explains this local shuffle:

And also # of partitions from local shuffles = #of upstream map tasks.

In our case, it is 30 and 4. (I will compare these numbers to next test.)

3. AQE on but local shuffle reader is disabled

set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.localShuffleReader.enabled=false;

This is for testing purpose, and we should not disable local shuffle reader as always.

The reason why I disable it is to show the shuffle reader statistics differences comparing to #2:

Now it is shown as "CustomShuffleReader coalesced".

And also the # of partition changed to 52 and 5 from 30 and 4.

4. GPU Mode with AQE on

Now let's try the same minimum query using Rapids for Spark Accelerator(current release 0.3) + Spark to see what is the query plan under GPU.

Explain plan output looks as CPU plan, but do not worry, the actual plan is GPU plan:

== Optimized Logical Plan ==
Sort [total_sales#20L DESC NULLS LAST], true, Statistics(sizeInBytes=368.1 PiB)
+- Aggregate [s_date#28], [s_date#28, sum(cast((s_quantity#27 * i_price#30) as bigint)) AS total_sales#20L], Statistics(sizeInBytes=368.1 PiB)
   +- Project [s_quantity#27, s_date#28, i_price#30], Statistics(sizeInBytes=368.1 PiB)
      +- Join Inner, (cast(s_item_id#26 as bigint) = i_item_id#29L), Statistics(sizeInBytes=589.0 PiB)
         :- Filter isnotnull(s_item_id#26), Statistics(sizeInBytes=3.7 GiB)
         :  +- Relation[s_item_id#26,s_quantity#27,s_date#28] parquet, Statistics(sizeInBytes=3.7 GiB)
         +- Filter ((isnotnull(i_price#30) AND (i_price#30 < 10)) AND isnotnull(i_item_id#29L)), Statistics(sizeInBytes=157.6 MiB)
            +- Relation[i_item_id#29L,i_price#30] parquet, Statistics(sizeInBytes=157.6 MiB)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [total_sales#20L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(total_sales#20L DESC NULLS LAST, 2), ENSURE_REQUIREMENTS, [id=#73]
      +- HashAggregate(keys=[s_date#28], functions=[sum(cast((s_quantity#27 * i_price#30) as bigint))], output=[s_date#28, total_sales#20L])
         +- Exchange hashpartitioning(s_date#28, 2), ENSURE_REQUIREMENTS, [id=#70]
            +- HashAggregate(keys=[s_date#28], functions=[partial_sum(cast((s_quantity#27 * i_price#30) as bigint))], output=[s_date#28, sum#34L])
               +- Project [s_quantity#27, s_date#28, i_price#30]
                  +- SortMergeJoin [cast(s_item_id#26 as bigint)], [i_item_id#29L], Inner
                     :- Sort [cast(s_item_id#26 as bigint) ASC NULLS FIRST], false, 0
                     :  +- Exchange hashpartitioning(cast(s_item_id#26 as bigint), 2), ENSURE_REQUIREMENTS, [id=#62]
                     :     +- Filter isnotnull(s_item_id#26)
                     :        +- FileScan parquet aqe_demo_db.sales[s_item_id#26,s_quantity#27,s_date#28] Batched: true, DataFilters: [isnotnull(s_item_id#26)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [IsNotNull(s_item_id)], ReadSchema: struct<s_item_id:int,s_quantity:int,s_date:date>
                     +- Sort [i_item_id#29L ASC NULLS FIRST], false, 0
                        +- Exchange hashpartitioning(i_item_id#29L, 2), ENSURE_REQUIREMENTS, [id=#63]
                           +- Filter ((isnotnull(i_price#30) AND (i_price#30 < 10)) AND isnotnull(i_item_id#29L))
                              +- FileScan parquet aqe_demo_db.items[i_item_id#29L,i_price#30] Batched: true, DataFilters: [isnotnull(i_price#30), (i_price#30 < 10), isnotnull(i_item_id#29L)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/items], PartitionFilters: [], PushedFilters: [IsNotNull(i_price), LessThan(i_price,10), IsNotNull(i_item_id)], ReadSchema: struct<i_item_id:bigint,i_price:int>

If we actually run this query, here is the actual final plan shown in UI:

The key things to look at here is the "GpuCustomShuffleReader local" and also the # of local shuffle partitions = 30 and 4 which matches the # of upstream map tasks.

Note that in GPU mode, all the data size are smaller than CPU mode.

For example, now the smaller side of join in GPU mode is only 3.4MB now:

It means, we can even set spark.sql.autoBroadcastJoinThreshold=4194304(4MB), it will still be converted to a BHJ under AQE.

And the shuffle writes/reads size are also smaller than CPU mode.

Reference:

Spark Tuning -- Adaptive Query Execution(1): Dynamically coalescing shuffle partitions

2021-03-16T15:15:00.003-07:00

Goal:

This article explains Adaptive Query Execution (AQE)'s "Dynamically coalescing shuffle partitions" feature introduced in Spark 3.0.

Env:

Spark 3.0.2

Concept:

Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default.Spark SQL can use the umbrella configuration of spark.sql.adaptive.enabled to control whether turn it on/off.

In AQE on Spark 3.0, there are 3 features as below:

Dynamically coalescing shuffle partitions
Dynamically switching join strategies
Dynamically optimizing skew joins

This article focuses on 1st feature "Dynamically coalescing shuffle partitions".

This feature coalesces the post shuffle partitions based on the map output statistics when both spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled configurations are true.

In below test, we will change spark.sql.adaptive.coalescePartitions.minPartitionNum to 1 which controls the minimum number of shuffle partitions after coalescing. If we do not decrease it, its default value is the same as spark.sql.shuffle.partitions(which is 200 by default).

Another important setting is spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) which controls the advisory size in bytes of the shuffle partition during adaptive optimization.

Please refer to Spark Performance Tuning guide for details on all other related parameters.

Solution:

As per databricks blog "Adaptive Query Execution: Speeding Up Spark SQL at Runtime", it has a pretty good demo notebook which I will use for the following tests.

I will run below simple group-by query based on the tables created based in above demo instructions in different modes:

use aqe_demo_db;

SELECT s_date, sum(s_quantity) AS q
FROM sales
GROUP BY s_date
ORDER BY q DESC;

1. Default settings without AQE

Explain plan:

*(3) Sort [q#10L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(q#10L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#79]
   +- *(2) HashAggregate(keys=[s_date#3], functions=[sum(cast(s_quantity#2 as bigint))], output=[s_date#3, q#10L])
      +- Exchange hashpartitioning(s_date#3, 200), ENSURE_REQUIREMENTS, [id=#75]
         +- *(1) HashAggregate(keys=[s_date#3], functions=[partial_sum(cast(s_quantity#2 as bigint))], output=[s_date#3, sum#19L])
            +- *(1) ColumnarToRow
               +- FileScan parquet aqe_demo_db.sales[s_quantity#2,s_date#3] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_quantity:int,s_date:date>

Let's focus on the 1st pair of HashAggregate and Exchange in which we can examine the shuffle read and shuffle write size for each task.

As per UI:

The shuffle writes per task is around 13KB which is too small for each task to process after that.

Let's look at stage level metrics for stage 0 and stage 1 as per above UI.

Stage 0's Shuffle Write Size: Avg 12.9KB , 30 tasks

Stage 1's Shuffle Read Size: Avg 2.3KB, 200 tasks

Here is the final plan from UI(for comparison later):

== Physical Plan ==
* Sort (7)
+- Exchange (6)
   +- * HashAggregate (5)
      +- Exchange (4)
         +- * HashAggregate (3)
            +- * ColumnarToRow (2)
               +- Scan parquet aqe_demo_db.sales (1)

2. Default settings with AQE on

set spark.sql.adaptive.enabled = true;
set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1;

Explain plan:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [q#34L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(q#34L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#119]
      +- HashAggregate(keys=[s_date#23], functions=[sum(cast(s_quantity#22 as bigint))], output=[s_date#23, q#34L])
         +- Exchange hashpartitioning(s_date#23, 200), ENSURE_REQUIREMENTS, [id=#116]
            +- HashAggregate(keys=[s_date#23], functions=[partial_sum(cast(s_quantity#22 as bigint))], output=[s_date#23, sum#43L])
               +- FileScan parquet aqe_demo_db.sales[s_quantity#22,s_date#23] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_quantity:int,s_date:date>

Notice here is the keyword "AdaptiveSparkPlan" but as it mentions this is not final plan.

Let's focus on the 1st pair of HashAggregate and Exchange in which we can examine the shuffle read and shuffle write size for each task.

As per UI:

Now there is an extra "CustomShuffleReader" operator which coalesces the partitions to only 1 because the total partition data size is only 400KB.

Let's look at stage level metrics for stage 0 and stage 2 as per above UI.

Stage 0's Shuffle Write Size: Avg 12.9KB , 30 tasks(no change)

Stage 2's Shuffle Read Size: 386.6KB, 1 task

So basically AQE combines all of the 200 partitions into 1.

Here is the final plan from UI which shows as below which you can find "CustomShuffleReader" keywords.

== Physical Plan ==
AdaptiveSparkPlan (12)
+- == Final Plan ==
   * Sort (11)
   +- CustomShuffleReader (10)
      +- ShuffleQueryStage (9)
         +- Exchange (8)
            +- * HashAggregate (7)
               +- CustomShuffleReader (6)
                  +- ShuffleQueryStage (5)
                     +- Exchange (4)
                        +- * HashAggregate (3)
                           +- * ColumnarToRow (2)
                              +- Scan parquet aqe_demo_db.sales (1)

3. Modified settings with AQE on

set spark.sql.adaptive.enabled = true;
set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1;
set spark.sql.adaptive.advisoryPartitionSizeInBytes = 65536;

Here we just changed spark.sql.adaptive.advisoryPartitionSizeInBytes from default 64MB to 64KB, so that we can tune the target # of partitions.

The explain plan is the same as #2.

The only difference is the # of partitions becomes 7 in Stage 2 now:

4. GPU Mode with AQE on(default settings)

Now let's try the same minimum query using Rapids for Spark Accelerator(current release 0.3) + Spark to see what is the query plan under GPU:

The explain plan may look as normal CPU plan because AQE is on, but actually if you run it, it will show you the correct final plan.

Explain plan:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [q#20L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(q#20L DESC NULLS LAST, 2), ENSURE_REQUIREMENTS, [id=#39]
      +- HashAggregate(keys=[s_date#28], functions=[sum(cast(s_quantity#27 as bigint))], output=[s_date#28, q#20L])
         +- Exchange hashpartitioning(s_date#28, 2), ENSURE_REQUIREMENTS, [id=#36]
            +- HashAggregate(keys=[s_date#28], functions=[partial_sum(cast(s_quantity#27 as bigint))], output=[s_date#28, sum#32L])
               +- FileScan parquet aqe_demo_db.sales[s_quantity#27,s_date#28] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_quantity:int,s_date:date>

Final Plan from UI:

== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
   GpuColumnarToRow (14)
   +- GpuSort (13)
      +- GpuCoalesceBatches (12)
         +- GpuShuffleCoalesce (11)
            +- GpuCustomShuffleReader (10)
               +- ShuffleQueryStage (9)
                  +- GpuColumnarExchange (8)
                     +- GpuHashAggregate (7)
                        +- GpuShuffleCoalesce (6)
                           +- GpuCustomShuffleReader (5)
                              +- ShuffleQueryStage (4)
                                 +- GpuColumnarExchange (3)
                                    +- GpuHashAggregate (2)
                                       +- GpuScan parquet aqe_demo_db.sales (1)

Stage 0's Shuffle Write Size: Avg 3.2KB , 30 tasks(huge decrease due to columnar storage processing)

Stage 2's Shuffle Read Size: 97.5KB, 1 task

Basically GPU mode can produce much less shuffle files which result in much less shuffle writes and reads.

References:

Spark Tuning -- Dynamic Partition Pruning

2021-03-15T17:36:00.009-07:00

Goal:

This article explains Dynamic Partition Pruning (DPP) feature introduced in Spark 3.0.

Env:

Spark 3.0.2

Concept:

Dynamic Partition Pruning feature is introduced by SPARK-11150 .

This JIRA also provides a minimal query and its design for example:

Here let's assume: "t1" is a very large fact table with partition key column "pKey", and "t2" is a small dimension table.

Since there is a filter on "t2" -- "t2.id < 2", internally DPP can create a subquery:

SELECT t2.pKey FROM t2 WHERE t2.id;

and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".

In the meantime, the sub-query result is re-used. See below graph from this slides from databricks:

As a result, we can save lots of table scan on the fact table side which brings huge performance gain.

The parameter to enable or disable DPP is:

spark.sql.optimizer.dynamicPartitionPruning.enabled (true by default)

Spark is not the only product using DPP and some other query engines also have this feature such as impala, Hive on Tez, etc.

Solution:

1. CPU mode

Here is a simple example query(run in spark-shell) which can help us check if DPP is used or not:

spark.range(1000).select(col("id"), col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").save("/tmp/myfact")
spark.range(100).select(col("id"), col("id").as("k")).write.format("parquet").mode("overwrite").save("/tmp/mydim")
spark.read.parquet("/tmp/myfact").createOrReplaceTempView("fact")
spark.read.parquet("/tmp/mydim").createOrReplaceTempView("dim")
sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain

The physical plan is:

scala> sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain
== Physical Plan ==
*(2) Project [id#14L, k#15]
+- *(2) BroadcastHashJoin [cast(k#15 as bigint)], [k#19L], Inner, BuildRight
   :- *(2) ColumnarToRow
   :  +- FileScan parquet [id#14L,k#15] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/myfact], PartitionFilters: [isnotnull(k#15), dynamicpruningexpression(cast(k#15 as bigint) IN dynamicpruning#24)], PushedFilters: [], ReadSchema: struct<id:bigint>
   :        +- SubqueryBroadcast dynamicpruning#24, 0, [k#19L], [id=#118]
   :           +- ReusedExchange [k#19L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#96]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#96]
      +- *(1) Project [k#19L]
         +- *(1) Filter ((isnotnull(id#18L) AND (id#18L < 2)) AND isnotnull(k#19L))
            +- *(1) ColumnarToRow
               +- FileScan parquet [id#18L,k#19L] Batched: true, DataFilters: [isnotnull(id#18L), (id#18L < 2), isnotnull(k#19L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint>

Let's compare it to a plan with DPP disabled:

scala> sql("set spark.sql.optimizer.dynamicPartitionPruning.enabled=false")
res14: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain
== Physical Plan ==
*(2) Project [id#35L, k#36]
+- *(2) BroadcastHashJoin [cast(k#36 as bigint)], [k#40L], Inner, BuildRight
   :- *(2) ColumnarToRow
   :  +- FileScan parquet [id#35L,k#36] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/myfact], PartitionFilters: [isnotnull(k#36)], PushedFilters: [], ReadSchema: struct<id:bigint>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#288]
      +- *(1) Project [k#40L]
         +- *(1) Filter ((isnotnull(id#39L) AND (id#39L < 2)) AND isnotnull(k#40L))
            +- *(1) ColumnarToRow
               +- FileScan parquet [id#39L,k#40L] Batched: true, DataFilters: [isnotnull(id#39L), (id#39L < 2), isnotnull(k#40L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint>

As you can see, when DPP is enabled, we can see the keyword "ReusedExchange" and "SubqueryBroadcast" before scanning the fact table.

In the fact table scan phase, there is keyword "dynamicpruningexpression".

If we let the query run with DPP enabled, then we can check the runtime query plan from UI:

Here you should notice that the "dynamic partition pruning time: 41 ms" and also the "number of partitions read: 2" which means DPP is taking effect.

Now let's take a look at a more complex example which is q98 in TPCDS:

select i_item_desc, i_category, i_class, i_current_price
      ,sum(ss_ext_sales_price) as itemrevenue
      ,sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over
          (partition by i_class) as revenueratio
from
	 store_sales, item, date_dim
where
	ss_item_sk = i_item_sk
  	and i_category in ('Sports', 'Books', 'Home')
  	and ss_sold_date_sk = d_date_sk
	and d_date between cast('1999-02-22' as date)
				and (cast('1999-02-22' as date) + interval '30' day)
group by
	i_item_id, i_item_desc, i_category, i_class, i_current_price
order by
	i_category, i_class, i_item_id, i_item_desc, revenueratio;

We just need to focus on fact table "store_sales" joining dimension table "date_dim" based on join key "ss_sold_date_sk = d_date_sk".

The column "ss_sold_date_sk" is also the partition key for "store_sales".

"date_dim" has a filter on column "d_date" to only fetch 30 days' data.

Now the query plan is:

== Physical Plan ==
*(7) Project [i_item_desc#97, i_category#105, i_class#103, i_current_price#98, itemrevenue#159, revenueratio#160]
+- *(7) Sort [i_category#105 ASC NULLS FIRST, i_class#103 ASC NULLS FIRST, i_item_id#94 ASC NULLS FIRST, i_item_desc#97 ASC NULLS FIRST, revenueratio#160 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(i_category#105 ASC NULLS FIRST, i_class#103 ASC NULLS FIRST, i_item_id#94 ASC NULLS FIRST, i_item_desc#97 ASC NULLS FIRST, revenueratio#160 ASC NULLS FIRST, 20), true, [id=#490]
      +- *(6) Project [i_item_desc#97, i_category#105, i_class#103, i_current_price#98, itemrevenue#159, ((_w0#170 * 100.0) / _we0#172) AS revenueratio#160, i_item_id#94]
         +- Window [sum(_w1#171) windowspecdefinition(i_class#103, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we0#172], [i_class#103]
            +- *(5) Sort [i_class#103 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(i_class#103, 20), true, [id=#482]
                  +- *(4) HashAggregate(keys=[i_item_id#94, i_item_desc#97, i_category#105, i_class#103, i_current_price#98], functions=[sum(ss_ext_sales_price#84)], output=[i_item_desc#97, i_category#105, i_class#103, i_current_price#98, itemrevenue#159, _w0#170, _w1#171, i_item_id#94])
                     +- Exchange hashpartitioning(i_item_id#94, i_item_desc#97, i_category#105, i_class#103, i_current_price#98, 20), true, [id=#478]
                        +- *(3) HashAggregate(keys=[i_item_id#94, i_item_desc#97, i_category#105, i_class#103, knownfloatingpointnormalized(normalizenanandzero(i_current_price#98)) AS i_current_price#98], functions=[partial_sum(ss_ext_sales_price#84)], output=[i_item_id#94, i_item_desc#97, i_category#105, i_class#103, i_current_price#98, sum#175])
                           +- *(3) Project [ss_ext_sales_price#84, i_item_id#94, i_item_desc#97, i_current_price#98, i_class#103, i_category#105]
                              +- *(3) BroadcastHashJoin [ss_sold_date_sk#92], [d_date_sk#115], Inner, BuildRight
                                 :- *(3) Project [ss_ext_sales_price#84, ss_sold_date_sk#92, i_item_id#94, i_item_desc#97, i_current_price#98, i_class#103, i_category#105]
                                 :  +- *(3) BroadcastHashJoin [ss_item_sk#71], [i_item_sk#93], Inner, BuildRight
                                 :     :- *(3) Project [ss_item_sk#71, ss_ext_sales_price#84, ss_sold_date_sk#92]
                                 :     :  +- *(3) Filter isnotnull(ss_item_sk#71)
                                 :     :     +- *(3) ColumnarToRow
                                 :     :        +- FileScan parquet tpcds.store_sales[ss_item_sk#71,ss_ext_sales_price#84,ss_sold_date_sk#92] Batched: true, DataFilters: [isnotnull(ss_item_sk#71)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/store_sales/ss_sold_date_sk=24..., PartitionFilters: [isnotnull(ss_sold_date_sk#92), dynamicpruningexpression(ss_sold_date_sk#92 IN dynamicpruning#173)], PushedFilters: [IsNotNull(ss_item_sk)], ReadSchema: struct<ss_item_sk:int,ss_ext_sales_price:double>
                                 :     :              +- SubqueryBroadcast dynamicpruning#173, 0, [d_date_sk#115], [id=#466]
                                 :     :                 +- ReusedExchange [d_date_sk#115], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#426]
                                 :     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#417]
                                 :        +- *(1) Project [i_item_sk#93, i_item_id#94, i_item_desc#97, i_current_price#98, i_class#103, i_category#105]
                                 :           +- *(1) Filter (i_category#105 IN (Sports,Books,Home) AND isnotnull(i_item_sk#93))
                                 :              +- *(1) ColumnarToRow
                                 :                 +- FileScan parquet tpcds.item[i_item_sk#93,i_item_id#94,i_item_desc#97,i_current_price#98,i_class#103,i_category#105] Batched: true, DataFilters: [i_category#105 IN (Sports,Books,Home), isnotnull(i_item_sk#93)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/item], PartitionFilters: [], PushedFilters: [In(i_category, [Sports,Books,Home]), IsNotNull(i_item_sk)], ReadSchema: struct<i_item_sk:int,i_item_id:string,i_item_desc:string,i_current_price:double,i_class:string,i_...
                                 +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#426]
                                    +- *(2) Project [d_date_sk#115]
                                       +- *(2) Filter (((isnotnull(d_date#117) AND (d_date#117 >= 10644)) AND (d_date#117 <= 10674)) AND isnotnull(d_date_sk#115))
                                          +- *(2) ColumnarToRow
                                             +- FileScan parquet tpcds.date_dim[d_date_sk#115,d_date#117] Batched: true, DataFilters: [isnotnull(d_date#117), (d_date#117 >= 10644), (d_date#117 <= 10674), isnotnull(d_date_sk#115)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), LessThanOrEqual(d_date,1999-03-24), Is..., ReadSchema: struct<d_date_sk:int,d_date:date>

Key point is:

:     :        +- FileScan parquet tpcds.store_sales[ss_item_sk#71,ss_ext_sales_price#84,ss_sold_date_sk#92] Batched: true, DataFilters: [isnotnull(sLocation: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/store_sales/ss_sold_date_sk=24..., PartitionFilters: [isnotnull(ss_sold_date_ss_sold_date_sk#92 IN dynamicpruning#173)], PushedFilters: [IsNotNull(ss_item_sk)], ReadSchema: struct<ss_item_sk:int,ss_ext_sales_price:double>
:     :              +- SubqueryBroadcast dynamicpruning#173, 0, [d_date_sk#115], [id=#466]
:     :                 +- ReusedExchange [d_date_sk#115], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#426]

The fact table scan have this DPP enabled in "PartitionFilters: [isnotnull(ss_sold_date_ss_sold_date_sk#92 IN dynamicpruning#173)]".

"dynamicpruning#173" basically comes from the broadcasted sub-query.

2. GPU mode

Now let's try the same minimum query using Rapids for Spark Accelerator(current release 0.3) + Spark to see what is the query plan under GPU:

scala> sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [id#0L, k#1]
   +- GpuBroadcastHashJoin [cast(k#1 as bigint)], [k#5L], Inner, GpuBuildRight
      :- GpuFileGpuScan parquet [id#0L,k#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/myfact], PartitionFilters: [isnotnull(k#1), dynamicpruningexpression(cast(k#1 as bigint) IN dynamicpruning#10)], PushedFilters: [], ReadSchema: struct<id:bigint>
      :     +- SubqueryBroadcast dynamicpruning#10, 0, [k#5L], [id=#51]
      :        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#50]
      :           +- GpuColumnarToRow false
      :              +- GpuProject [k#5L]
      :                 +- GpuCoalesceBatches TargetSize(2147483647)
      :                    +- GpuFilter ((gpuisnotnull(id#4L) AND (id#4L < 2)) AND gpuisnotnull(k#5L))
      :                       +- GpuFileGpuScan parquet [id#4L,k#5L] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L < 2), isnotnull(k#5L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint>
      +- GpuBroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#70]
         +- GpuProject [k#5L]
            +- GpuCoalesceBatches TargetSize(2147483647)
               +- GpuFilter ((gpuisnotnull(id#4L) AND (id#4L < 2)) AND gpuisnotnull(k#5L))
                  +- GpuFileGpuScan parquet [id#4L,k#5L] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L < 2), isnotnull(k#5L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint>

As you can see, the DPP is also happening because when scanning fact table:

PartitionFilters: [isnotnull(k#1), dynamicpruningexpression(cast(k#1 as bigint) IN dynamicpruning#10)]

However here we see the sub-query on the dimension table is executed twice.

This performance overhead should be very minimum since normally the "broadcast side" sub-query should be very lightweight.

The on-going improvement for DPP will be tracked under this issue.

This is why it is also mentioned in current version of FAQ:

"Is Dynamic Partition Pruning (DPP) Supported?
Yes, DPP still works. It might not be as efficient as it could be, and we are working to improve it."

Key Takeaways:

DPP is a good feature for star-schema queries.

It uses partition runing and broadcast hash join together.

It currently only supports equi-join.

The table to prune(fact table) should be partitioned by the join key.

References:

How to use NVIDIA GPUs in docker container

2021-03-11T14:33:00.004-08:00

Goal:

This is a quick note on how to use NVIDIA GPUs in docker container.

Env:

Ubuntu 18.04

Docker 20.10.5

Solution:

The key is to install NVIDIA Container Toolkit that is why this note is quick:)

1. Install Docker on host machine where NVIDIA driver is already installed.

https://docs.docker.com/engine/install/ubuntu/

Note: Refer to this post on how to install CUDA Toolkit and NVIDIA Driver on ubuntu.

2. Install NVIDIA Container Toolkit on host machine

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

3. Test

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Or only expose the first GPU(with device=0) instead of all GPU to docker container:

docker run --rm --gpus device=0 nvidia/cuda:11.0-base nvidia-smi

Understanding RAPIDS Accelerator For Apache Spark parameter -- spark.rapids.memory.gpu.allocFraction and GPU pool related ones.

2021-03-10T12:21:00.003-08:00

Goal:

This article explains the RAPIDS Accelerator For Apache Spark parameter -- spark.rapids.memory.gpu.allocFraction and other GPU memory pool related ones: spark.rapids.memory.gpu.maxAllocFraction, spark.rapids.memory.gpu.reserve, spark.rapids.memory.gpu.debug and spark.rapids.memory.gpu.pool.

Env:

Spark 3.1.1

RAPIDS Accelerator For Apache Spark 0.4

Quadro RTX 6000 with 24G memory

Solution:

1. Concept

As per the configuration guide, spark.rapids.memory.gpu.pooling.enabled is DEPRECATED and we should use spark.rapids.memory.gpu.pool to switch on or off the GPU memory pooling feature, and also to choose which RMM(RAPIDS Memory Manager) pooling allocator to use.

ARENA: rmm::mr::arena_memory_resource
DEFAULT: rmm::mr::pool_memory_resource
NONE: Turn off pooling, and RMM just passes through to CUDA memory allocation directly

Even though the value "DEFAULT" could be confusing, but as of now, we would recommend "ARENA".

To learn more about RMM, this blog "Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager" would help understand.

If you want to dig into the source code of RMM, here it is: https://github.com/rapidsai/rmm .

In this article, I will use ARENA for all below tests.

After GPU memory pooling is enabled, below 3 parameters control how much memory will be pooled:

spark.rapids.memory.gpu.allocFraction : The fraction of total GPU memory that should be initially allocated for pooled memory. Default 0.9.
spark.rapids.memory.gpu.maxAllocFraction: The fraction of total GPU memory that limits the maximum size of the RMM pool. Default 1.0
spark.rapids.memory.gpu.reserve : The amount of GPU memory that should remain unallocated by RMM and left for system use such as memory needed for kernels, kernel launches or JIT compilation. Default 1g.

In simple, basically the default setting means, 90% of the GPU memory will be pooled but the max can not exceed 100% - 1g.

In the end, there is another parameter spark.rapids.memory.gpu.debug which can be used to enable debug logging into STDOUT or STDERR. Default is NONE.

2. Test

In below tests, I keep spark.rapids.memory.gpu.maxAllocFraction = default 1 and change spark.rapids.memory.gpu.allocFraction and spark.rapids.memory.gpu.reserve and in the meantime monitoring the logs and nvidia-smi output after "spark-shell" is launched with only 1 executor on a single node.

a. Default

spark.rapids.memory.gpu.allocFraction 0.9 (default)
spark.rapids.memory.gpu.reserve 1073741824 (default)

GPU memory utilization:

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB
4 %, 0 %, 24220 MiB, 23801 MiB, 419 MiB
3 %, 0 %, 24220 MiB, 1719 MiB, 22501 MiB
0 %, 0 %, 24220 MiB, 1719 MiB, 22501 MiB
0 %, 0 %, 24220 MiB, 1693 MiB, 22527 MiB

Executor Log:

21/03/10 10:42:25 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
21/03/10 10:42:30 INFO GpuDeviceManager: Initializing RMM ARENA initial size = 21798.28125 MB, max size = 23196.3125 MB on gpuId 0

b. Increased spark.rapids.memory.gpu.allocFraction from 0.9 to 0.99

spark.rapids.memory.gpu.allocFraction=0.99
spark.rapids.memory.gpu.reserve 1073741824 (default)

GPU memory utilization:

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB
0 %, 0 %, 24220 MiB, 24161 MiB, 59 MiB
3 %, 0 %, 24220 MiB, 23723 MiB, 497 MiB
0 %, 0 %, 24220 MiB, 321 MiB, 23899 MiB
0 %, 0 %, 24220 MiB, 321 MiB, 23899 MiB
0 %, 0 %, 24220 MiB, 321 MiB, 23899 MiB
0 %, 0 %, 24220 MiB, 297 MiB, 23923 MiB

Executor Log:

21/03/10 10:46:54 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
21/03/10 10:46:59 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than free memory (23519.3125 MB)
21/03/10 10:46:59 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than the adjusted maximum allocation (23196.3125 MB), lowering initial allocation to the adjusted maximum allocation.
21/03/10 10:46:59 INFO GpuDeviceManager: Initializing RMM ARENA initial size = 23196.3125 MB, max size = 23196.3125 MB on gpuId 0

c. Increased spark.rapids.memory.gpu.allocFraction from 0.9 to 0.99 and also spark.rapids.memory.gpu.reserve from 1g to 2g

spark.rapids.memory.gpu.allocFraction=0.99
spark.rapids.memory.gpu.reserve 2147483648

GPU memory utilization:

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB
4 %, 0 %, 24220 MiB, 24041 MiB, 179 MiB
5 %, 0 %, 24220 MiB, 23711 MiB, 509 MiB
0 %, 0 %, 24220 MiB, 1345 MiB, 22875 MiB
0 %, 0 %, 24220 MiB, 1345 MiB, 22875 MiB
0 %, 0 %, 24220 MiB, 1345 MiB, 22875 MiB
0 %, 0 %, 24220 MiB, 1321 MiB, 22899 MiB

Executor Log:

21/03/10 10:49:49 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
21/03/10 10:49:54 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than free memory (23519.3125 MB)
21/03/10 10:49:54 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than the adjusted maximum allocation (22172.3125 MB), lowering initial allocation to the adjusted maximum allocation.
21/03/10 10:49:54 INFO GpuDeviceManager: Initializing RMM ARENA initial size = 22172.3125 MB, max size = 22172.3125 MB on gpuId 0

d. Disable GPU memory pool

spark.rapids.memory.gpu.pool NONE

GPU memory utilization:

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB
3 %, 0 %, 24220 MiB, 23891 MiB, 329 MiB
5 %, 0 %, 24220 MiB, 23567 MiB, 653 MiB
0 %, 0 %, 24220 MiB, 23519 MiB, 701 MiB
0 %, 0 %, 24220 MiB, 23519 MiB, 701 MiB
1 %, 0 %, 24220 MiB, 23495 MiB, 725 MiB

21/03/10 12:03:07 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
21/03/10 12:03:12 INFO GpuDeviceManager: Initializing RMM  initial size = 21798.28125 MB, max size = 0.0 MB on gpuId 0

e. Enable DEBUG

spark.rapids.memory.gpu.debug STDOUT

stdout:

$  tail -100f stdout
Thread,Time,Action,Pointer,Size,Stream
15129,11:04:56:292725,allocate,0x7f7192600000,18480,0x0
15129,11:04:56:293529,allocate,0x7f7140000000,50686648,0x0
15129,11:04:56:317040,allocate,0x7f7143200000,14174424,0x0
15129,11:04:56:319691,allocate,0x7f7192800000,13951936,0x0
15129,11:04:56:321843,allocate,0x7f713e000000,13936328,0x0
15129,11:04:56:323874,allocate,0x7f713ee00000,13929272,0x0
15129,11:04:56:325937,allocate,0x7f7192604a00,26432,0x0
15129,11:04:56:326309,allocate,0x7f7134000000,139910792,0x0
15129,11:04:56:326346,allocate,0x7f719260b200,13216,0x0
15129,11:04:56:326371,allocate,0x7f719260e600,6608,0x0
15129,11:04:56:370310,free,0x7f719260e600,6608,0x0
15129,11:04:56:370327,free,0x7f719260b200,13216,0x0
15129,11:04:56:370335,free,0x7f7140000000,50686648,0x0
15129,11:04:56:370490,free,0x7f7143200000,14174424,0x0
15129,11:04:56:371885,free,0x7f7192800000,13951936,0x0

3. Key takeaways

Allocating memory on a GPU can be an expensive operation so it is recommended to use GPU memory pool feature.

DEBUG log is useful because it can show each allocate/free actions.