tag:blogger.com,1999:blog-9292704105155687022024-03-13T23:15:20.423-07:00Open Knowledge BaseOpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.comBlogger318125tag:blogger.com,1999:blog-929270410515568702.post-44817172426702988342022-07-19T10:25:00.004-07:002022-07-19T10:25:35.988-07:00Spark writing to S3 failed: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument<h2 style="text-align: left;">Symptom:</h2><p>When using Spark writing to S3, the insert query failed:</p>
<pre class="brush:text; toolbar: false; auto-links: false">Caused by: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V<br /> at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)<br /> at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)<br /> at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1580)<br /> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:341)<br /> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)<br /> at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)<br /> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)<br /> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)<br /> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)<br /> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)<br /> at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)<br /> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)<br /> at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:252)<br /> at org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)<br /> at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)<br /> at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)<br /> at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)<br /> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)</pre>
<h2 style="text-align: left;">Env:</h2><p>spark-3.2.1-bin-hadoop3.2<br />hadoop-aws-3.2.3.jar<br />aws-java-sdk-bundle-1.11.375.jar<br />guava-14.0.1.jar<span></span></p><a name='more'></a><p></p><h2 style="text-align: left;">Solution:</h2><p>Remove guava-14.0.1.jar from Spark and use the Hive's newer guava-27.0-jre.jar.</p>
<pre class="brush:bash; toolbar: false; auto-links: false">$ ls -altr $SPARK_HOME/jars/guava*.jar<br />lrwxrwxrwx 1 xxx xxx 46 Jul 19 09:39 /home/xxx/spark/myspark/jars/guava-27.0-jre.jar -> /home/xxx/hive/myhive/lib/guava-27.0-jre.jar</pre>
<p> <br /></p><p> <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com2tag:blogger.com,1999:blog-929270410515568702.post-41992559953258545392022-07-19T10:14:00.003-07:002022-07-19T10:14:56.519-07:00Spark writing to S3 failed: java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<h2 style="text-align: left;">Symptom:</h2><p>When using Spark writing to S3, the insert query failed:</p>
<pre class="brush:text; toolbar: false; auto-links: false">java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V<br /> at org.apache.hadoop.fs.s3a.impl.StoreContext.createThrottledExecutor(StoreContext.java:292)<br /> at org.apache.hadoop.fs.s3a.impl.DeleteOperation.<init>(DeleteOperation.java:206)<br /> at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:2468)<br /> at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:532)<br /> at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:551)<br /> at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:242)<br /> at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:262)</pre><h2 style="text-align: left;">Env:</h2><p>spark-3.2.1-bin-hadoop3.2</p><p>hadoop-aws-3.2.1.jar<br /></p><p>aws-java-sdk-bundle-1.11.375.jar<span></span></p><a name='more'></a><p></p><h2 style="text-align: left;">Solution:</h2><p>After upgrading hadoop-aws-3.2.1.jar to hadoop-aws-3.2.3.jar, it works fine.</p>
<pre class="brush:bash; toolbar: false; auto-links: false">$ ls -altr $SPARK_HOME/jars|grep -i aws<br />-rw-rw-r-- 1 xxx xxx 98732349 Jul 26 2018 aws-java-sdk-bundle-1.11.375.jar<br />-rw-rw-r-- 1 xxx xxx 506819 Jul 19 10:03 hadoop-aws-3.2.3.jar</pre>
<p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-87431426142094331232021-09-23T17:02:00.000-07:002021-09-23T17:02:17.679-07:00How to access Azure Open Dataset from Spark<h1 style="text-align: left;">Goal:</h1><p>This article explains how to access Azure Open Dataset from Spark. <br /></p><h1 style="text-align: left;">Env:</h1><p>spark-3.1.1-bin-hadoop2.7</p><span><a name='more'></a></span><h1 style="text-align: left;">Solution:</h1><p>Microsoft <a href="https://docs.microsoft.com/en-us/azure/open-datasets/" rel="nofollow" target="_blank">Azure Open Dataset</a> is curated and cleansed data - including weather, census, and holidays -
that you can use with minimal preparation to enrich ML models.</p><p>If we want to access it from local Spark environment, we need 2 jars :</p><ul style="text-align: left;"><li>azure-storage-<version>.jar</li><li>hadoop-azure-<version>.jar <br /></li></ul><p>My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar. </p><p>In this example, I downloaded below two jars:</p><ul style="text-align: left;"><li><a href="https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/8.6.6/azure-storage-8.6.6.jar" rel="nofollow" target="_blank">azure-storage-8.6.6.jar</a></li><li><a href="https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.8.5/hadoop-azure-2.8.5.jar" rel="nofollow" target="_blank">hadoop-azure-2.8.5.jar</a><br /></li></ul><h2 style="text-align: left;">1. Add above 2 jars into Spark classpath.</h2>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.executor.extraClassPath<br />spark.driver.extraClassPath</pre>
<h2 style="text-align: left;">2. Add Azure Blob Storage related Hadoop configs</h2><p>For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):<br /></p>
<pre class="brush:python; toolbar: false; auto-links: false">sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")<br />sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")<br />sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")<br />sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")<br />sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")</pre>
<h2 style="text-align: left;">3. Follow PySpark commands to access Azure Open Dataset</h2><p>For example, the <a href="https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=pyspark" rel="nofollow" target="_blank">PySpark commands</a> are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.</p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com2tag:blogger.com,1999:blog-929270410515568702.post-35953563454828033782021-05-03T15:20:00.007-07:002021-05-07T09:54:22.806-07:00Understand Decimal precision and scale calculation in Spark using GPU or CPU mode<h1 style="text-align: left;">Goal:</h1><p>This article research on how Spark calculates the Decimal precision and scale using GPU or CPU mode. </p><p>Basically we will test Addition/Subtraction/Multiplication/Division/Modulo/Union in this post.<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>Spark 3.1.1</p><p>Rapids accelerator 0.5 snapshot with cuDF 0.19 snapshot jar<br /></p><h1 style="text-align: left;">Concept:</h1><p>Spark's logic to calculates the Decimal precision and scale is inside <a href="https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala">DecimalPrecision.scala</a>.<br /></p>
<pre class="brush:text; toolbar: false; auto-links: false"> * In particular, if we have expressions e1 and e2 with precision/scale p1/s1 and p2/s2<br /> * respectively, then the following operations have the following precision / scale:<br /> *<br /> * Operation Result Precision Result Scale<br /> * ------------------------------------------------------------------------<br /> * e1 + e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)<br /> * e1 - e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)<br /> * e1 * e2 p1 + p2 + 1 s1 + s2<br /> * e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1)<br /> * e1 % e2 min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)<br /> * e1 union e2 max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)</pre>
<p>This matches the Hive's rule in this <a href="https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf" rel="nofollow" target="_blank">Hive Decimal Precision/Scale Support</a> document.<br /></p><p>Other than that, Spark has a parameter <b><i>spark.sql.decimalOperations.allowPrecisionLoss</i></b> (default true) to control if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals.<br /></p><p> </p><p>Now let's look at GPU mode(with Rapids accelerator)'s limit: </p><p>Currently in Rapids accelerator 0.4.1/0.5 snapshot release, the limit for decimal is up to 18 digits(64bits) as per <a href="https://nvidia.github.io/spark-rapids/docs/supported_ops.html" rel="nofollow" target="_blank">this Doc</a>.<br /></p><p>So if the precision is > 18, it will fallback to CPU mode.</p><p>Below let's do some tests to confirm the theory matches practice.<br /></p><h1 style="text-align: left;">Solution:</h1><h2 style="text-align: left;">1. Prepare an example Dataframe with different types of decimal <br /></h2>
<pre class="brush:java; toolbar: false; auto-links: false">import org.apache.spark.sql.functions._<br />import spark.implicits._<br />import org.apache.spark.sql.types._<br />spark.conf.set("spark.rapids.sql.enabled", true)<br />spark.conf.set("spark.rapids.sql.decimalType.enabled", true)<br /><br />val df = spark.sparkContext.parallelize(Seq(1)).toDF()<br />val df2=df.withColumn("value82", (lit("123456.78").cast(DecimalType(8,2)))).<br /> withColumn("value63", (lit("123.456").cast(DecimalType(6,3)))).<br /> withColumn("value1510", (lit("12345.0123456789").cast(DecimalType(15,10)))).<br /> withColumn("value2510", (lit("123456789012345.0123456789").cast(DecimalType(25,10))))<br /><br />df2.write.parquet("/tmp/df2.parquet")<br />val newdf2=spark.read.parquet("/tmp/df2.parquet")<br />newdf2.createOrReplaceTempView("df2")</pre>
newdf2's schema: <br /> <pre class="brush:java; toolbar: false; auto-links: false">scala> newdf2.printSchema<br />root<br /> |-- value: integer (nullable = false)<br /> |-- value82: decimal(8,2) (nullable = true)<br /> |-- value63: decimal(6,3) (nullable = true)<br /> |-- value1510: decimal(15,10) (nullable = true)<br /> |-- value2510: decimal(25,10) (nullable = true)</pre><h2 style="text-align: left;">2. GPU Mode (Result Decimal within GPU's limit : <=18 digits)</h2><p>Below tests make sure all result decimal's precision is within GPU's limit which is 18 digits in this Rapids accelerator version.</p><p>So we only use 2 fields -- value82: decimal(8,2) and value63: decimal(6,3) of df2. <br /></p><p>This is to confirm that the theory works fine in GPU mode <i><b>or not</b></i>.<br /></p><p>To use above concept/theory to calculate the expected result precision and scale, let's use below code to calculate it in an easy way:</p>
<pre class="brush:java; toolbar: false; auto-links: false">import scala.math.{max, min}<br />val (p1,s1)=(8,2)<br />val (p2,s2)=(6,3)</pre>
<h3 style="text-align: left;">2.1 Addition</h3>
<pre class="brush:java; toolbar: false; auto-links: false">val df_plus=spark.sql("SELECT value82+value63 FROM df2")<br />df_plus.printSchema<br />df_plus.explain<br />df_plus.collect</pre>
<p>Output:</p>
<pre class="brush:java; toolbar: false; auto-links: false">scala> val df_plus=spark.sql("SELECT value82+value63 FROM df2")<br />df_plus: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(10,3)) + CAST(value63 AS DECIMAL(10,3))): decimal(10,3)]<br /><br />scala> df_plus.printSchema<br />root<br /> |-- (CAST(value82 AS DECIMAL(10,3)) + CAST(value63 AS DECIMAL(10,3))): decimal(10,3) (nullable = true)<br /><br /><br />scala> df_plus.explain<br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [gpucheckoverflow((gpupromoteprecision(cast(value82#58 as decimal(10,3))) + gpupromoteprecision(cast(value63#59 as decimal(10,3)))), DecimalType(10,3), true) AS (CAST(value82 AS DECIMAL(10,3)) + CAST(value63 AS DECIMAL(10,3)))#88]<br /> +- GpuFileGpuScan parquet [value82#58,value63#59] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/df2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value82:decimal(8,2),value63:decimal(6,3)><br /><br /><br /><br />scala> df_plus.collect<br />res21: Array[org.apache.spark.sql.Row] = Array([123580.236])</pre>
<p>The result Decimal is (10,3) which matches the theory, and it also runs on GPU as show from explain output.<br /></p><pre class="brush:java; toolbar: false; auto-links: false">scala> max(s1, s2) + max(p1-s1, p2-s2) + 1<br />res7: Int = 10<br /><br />scala> max(s1, s2)<br />res8: Int = 3</pre>
<p>Note: In the following tests, I will just show you the result instead of printing too much output to save the length of this post. But feel free to do the math yourself.</p><h3 style="text-align: left;">2.2 Subtraction</h3>
<pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (10,3)<br />val df_minus=spark.sql("SELECT value82-value63 FROM df2")<br />df_minus.printSchema<br />df_minus.explain<br />df_minus.collect</pre>
<h3 style="text-align: left;">2.3 Multiplication</h3><pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (15,5) <br />val df_multi=spark.sql("SELECT value82*value63 FROM df2")<br />df_multi.printSchema<br />df_multi.explain<br />df_multi.collect</pre>
<div style="text-align: left;">Output:</div>
<pre class="brush:java; toolbar: false; auto-links: false;highlight: [14,23]">scala> val df_multi=spark.sql("SELECT value82*value63 FROM df2")<br />df_multi: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3))): decimal(15,5)]<br /><br />scala> df_multi.printSchema<br />root<br /> |-- (CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3))): decimal(15,5) (nullable = true)<br /><br /><br />scala> df_multi.explain<br />21/05/04 18:02:21 WARN GpuOverrides:<br />!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced<br /> @Expression <Alias> CheckOverflow((promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))), DecimalType(15,5), true) AS (CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3)))#96 could run on GPU<br /> @Expression <CheckOverflow> CheckOverflow((promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))), DecimalType(15,5), true) could run on GPU<br /> !Expression <Multiply> (promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))) cannot run on GPU because The actual output precision of the multiply is too large to fit on the GPU DecimalType(19,6)<br /> @Expression <PromotePrecision> promote_precision(cast(value82#58 as decimal(9,3))) could run on GPU<br /> @Expression <Cast> cast(value82#58 as decimal(9,3)) could run on GPU<br /> @Expression <AttributeReference> value82#58 could run on GPU<br /> @Expression <PromotePrecision> promote_precision(cast(value63#59 as decimal(9,3))) could run on GPU<br /> @Expression <Cast> cast(value63#59 as decimal(9,3)) could run on GPU<br /> @Expression <AttributeReference> value63#59 could run on GPU<br /><br />== Physical Plan ==<br />*(1) Project [CheckOverflow((promote_precision(cast(value82#58 as decimal(9,3))) * promote_precision(cast(value63#59 as decimal(9,3)))), DecimalType(15,5), true) AS (CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3)))#96]<br />+- GpuColumnarToRow false<br /> +- GpuFileGpuScan parquet [value82#58,value63#59] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/df2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value82:decimal(8,2),value63:decimal(6,3)><br /><br /><br /><br />scala> df_multi.collect<br />res27: Array[org.apache.spark.sql.Row] = Array([15241480.23168])</pre>
<div style="text-align: left;">Here even though the result Decimal is just (15,5) but it still falls back on CPU.</div><div style="text-align: left;">This is because Spark inserts "PromotePrecision" to CAST both sides to the same type -- Decimal(9,3).</div><div style="text-align: left;">Currently GPU has to be very cautious on this PromotePrecision, so it thought the result is Decimal (19,6) instead of (15,5).<br /></div><h3 style="text-align: left;">2.4 Division</h3><pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (18,9) -- Fallback on CPU<br />val df_div=spark.sql("SELECT value82/value63 FROM df2")<br />df_div.printSchema<br />df_div.explain<br />df_div.collect</pre>
<h3 style="text-align: left;">2.5 Modulo</h3><pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (6,3) -- Fallback on CPU<br />val df_mod=spark.sql("SELECT value82 % value63 FROM df2")<br />df_mod.printSchema<br />df_mod.explain<br />df_mod.collect</pre>
<div style="text-align: left;"><b>Note: this is because Modulo is not supported for Decimal on GPU as per this <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/docs/supported_ops.md" rel="nofollow" target="_blank">supported_ops.md</a>. </b><br /></div><h3 style="text-align: left;">2.6 Union</h3><pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (9,3) <br />val df_union=spark.sql("SELECT value82 from df2 union SELECT value63 from df2")<br />df_union.printSchema<br />df_union.explain<br />df_union.collect<br /></pre>
<h2 style="text-align: left;">3. GPU Mode fallback to CPU (19 ~ 38 digits)</h2><p>Below tests may fall back to CPU if result decimal's precision is above GPU's
limit. </p><p>So we only use 2 fields -- value82: decimal(8,2) and value1510: decimal(15,10) of df2. <br /></p><h3 style="text-align: left;">3.1 Addition</h3>
<pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (17,10) -- within GPU limit<br />val df_plus=spark.sql("SELECT value82+value1510 FROM df2")<br />df_plus.printSchema<br />df_plus.explain<br />df_plus.collect</pre>
<h3 style="text-align: left;">3.2 Subtraction <br /></h3>
<pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (17,10) -- within GPU limit<br />val df_minus=spark.sql("SELECT value82-value1510 FROM df2")<br />df_minus.printSchema<br />df_minus.explain<br />df_minus.collect</pre>
<div style="text-align: left;"><h3 style="text-align: left;">3.3 Multiplication</h3></div>
<pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (24,12) -- outside of GPU limit<br />val df_multi=spark.sql("SELECT value82*value1510 FROM df2")<br />df_multi.printSchema<br />df_multi.explain<br /></pre>
<div style="text-align: left;">Output:</div>
<pre class="brush:java; toolbar: false; auto-links: false;highlight: [12,23]">scala> val df_multi=spark.sql("SELECT value82*value1510 FROM df2")<br />df_multi: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10))): decimal(24,12)]<br /><br />scala> df_multi.printSchema<br />root<br /> |-- (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10))): decimal(24,12) (nullable = true)<br /><br /><br />scala> df_multi.explain<br />21/05/04 18:44:46 WARN GpuOverrides:<br />!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced; unsupported data types in output: DecimalType(24,12)<br /> !Expression <Alias> CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) AS (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10)))#132 cannot run on GPU because expression Alias CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) AS (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10)))#132 produces an unsupported type DecimalType(24,12); expression CheckOverflow CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) produces an unsupported type DecimalType(24,12)<br /> !Expression <CheckOverflow> CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) cannot run on GPU because expression CheckOverflow CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) produces an unsupported type DecimalType(24,12)<br /> !Expression <Multiply> (promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))) cannot run on GPU because The actual output precision of the multiply is too large to fit on the GPU DecimalType(33,20)<br /> @Expression <PromotePrecision> promote_precision(cast(value82#58 as decimal(16,10))) could run on GPU<br /> @Expression <Cast> cast(value82#58 as decimal(16,10)) could run on GPU<br /> @Expression <AttributeReference> value82#58 could run on GPU<br /> @Expression <PromotePrecision> promote_precision(cast(value1510#60 as decimal(16,10))) could run on GPU<br /> @Expression <Cast> cast(value1510#60 as decimal(16,10)) could run on GPU<br /> @Expression <AttributeReference> value1510#60 could run on GPU<br /><br />== Physical Plan ==<br />*(1) Project [CheckOverflow((promote_precision(cast(value82#58 as decimal(16,10))) * promote_precision(cast(value1510#60 as decimal(16,10)))), DecimalType(24,12), true) AS (CAST(value82 AS DECIMAL(16,10)) * CAST(value1510 AS DECIMAL(16,10)))#132]<br />+- GpuColumnarToRow false<br /> +- GpuFileGpuScan parquet [value82#58,value1510#60] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/df2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value82:decimal(8,2),value1510:decimal(15,10)><br /><br /><br /><br />scala> df_multi.collect<br />res51: Array[org.apache.spark.sql.Row] = Array([1524075473.257763907942])</pre>
<div style="text-align: left;"><h3 style="text-align: left;">3.4 Division</h3></div><pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (34,18) -- outside of GPU limit<br />val df_div=spark.sql("SELECT value82/value1510 FROM df2")<br />df_div.printSchema<br />df_div.explain<br />df_div.collect</pre>
<div style="text-align: left;"><h3 style="text-align: left;">3.5 Modulo</h3></div><pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal(15,10) -- within GPU limit, but fallback on CPU<br />val df_mod=spark.sql("SELECT value82 % value1510 FROM df2")<br />df_mod.printSchema<br />df_mod.explain<br />df_mod.collect</pre>
<div style="text-align: left;"><div style="text-align: left;"><b>Note: this is because Modulo is not supported for Decimal on GPU as per this <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/docs/supported_ops.md" rel="nofollow" target="_blank">supported_ops.md</a>. </b><br /></div><h3 style="text-align: left;">3.6 Union</h3></div>
<pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (16,10) -- within GPU limit<br />val df_union=spark.sql("SELECT value82 from df2 union SELECT value1510 from df2")<br />df_union.printSchema<br />df_union.explain<br />df_union.collect</pre>
<div style="text-align: left;"><h2 style="text-align: left;">4. Above decimal max range (> 38 digits) </h2></div><div style="text-align: left;">If the result decimal is above 38 digits, <b><i>spark.sql.decimalOperations.allowPrecisionLoss</i></b> can be used to control the behavior.<br /></div><div style="text-align: left;">So we only use 2 fields -- value1510: decimal(15,10) and value2510: decimal(25,10) of df2. <br /></div>
<pre class="brush:java; toolbar: false; auto-links: false"># Result Decimal (38,17)<br />val df_multi=spark.sql("SELECT value1510*value2510 FROM df2")<br />df_multi.printSchema<br />df_multi.explain<br />df_multi.collect</pre>
<div style="text-align: left;">As per the theory, the result decimal should be (41,20): <br /></div>
<pre class="brush:java; toolbar: false; auto-links: false">scala> val (p1,s1)=(15,10)<br />p1: Int = 15<br />s1: Int = 10<br /><br />scala> val (p2,s2)=(25,10)<br />p2: Int = 25<br />s2: Int = 10<br /><br />scala> p1 + p2 + 1<br />res31: Int = 41<br /><br />scala> s1 + s2<br />res32: Int = 20</pre>
<div style="text-align: left;">However since 41>38, so another function <b><i>adjustPrecisionScale</i></b> inside <a href="https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala" rel="nofollow" target="_blank">DecimalType.scala</a> is called to adjust the precision and scale. </div><div style="text-align: left;">For this specific example, below code logic is applied:</div>
<pre class="brush:java; toolbar: false; auto-links: false"> } else {<br /> // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION.<br /> val intDigits = precision - scale<br /> // If original scale is less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise<br /> // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits<br /> val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)<br /> // The resulting scale is the maximum between what is available without causing a loss of<br /> // digits for the integer part of the decimal and the minimum guaranteed scale, which is<br /> // computed above<br /> val adjustedScale = Math.max(MAX_PRECISION - intDigits, minScaleValue)<br /><br /> DecimalType(MAX_PRECISION, adjustedScale)<br /> }</pre>
<div style="text-align: left;">So intDigits=41-20=21, minScaleValue=6, adjustedScale=max(38-21,6)=17.</div><div style="text-align: left;">That is why the result decimal is (38,17).<br /></div><div style="text-align: left;"> </div><div style="text-align: left;">Since above function is only called when <b><i>spark.sql.decimalOperations.allowPrecisionLoss</i></b>=true, so if we set it false, it will return null:</div>
<pre class="brush:java; toolbar: false; auto-links: false">scala> df_multi.collect<br />res67: Array[org.apache.spark.sql.Row] = Array([null])</pre>
<div style="text-align: left;"><h1 style="text-align: left;">References:</h1></div><p><a href="https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf" rel="nofollow" target="_blank"><span class="pl-c">https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf</span></a></p><p><span class="pl-c"> </span> <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com10tag:blogger.com,1999:blog-929270410515568702.post-10121474420173567682021-04-30T11:19:00.005-07:002021-04-30T11:19:57.551-07:00kubelet failed to start after rebooting<h1 style="text-align: left;">Symptom:</h1><p>kubelet failed to start after rebooting. </p><span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p>Ubuntu 18.04</p><p>Kubernetes 1.19 <br /></p><h1 style="text-align: left;">Root Cause:</h1><p>From "<b>journalctl -xefu kubelet</b>", we can find out the root cause:<br /></p>
<pre class="brush:text; toolbar: false; auto-links: false;highlight: 1">kubelet[11111]: F0430 xx:xx:xx.123456 11111 server.go:265] failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: </pre>
<p>Basically it means after rebooting, swap is on somehow.<br /></p><h1 style="text-align: left;">Solution: <br /></h1><p>As mentioned in another blog "How to install a Kubernetes Cluster on CentOS 7", follow step 1.2 Disable Swap.</p><pre class="brush:bash; toolbar: false; auto-links: false">swapoff -a</pre>
<p>And then comment out the swap entries in <b>/etc/fstab</b>.<br /></p><p>After that, "<b>systemctl status kubelet</b>" should show kubelet is active (running). <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com1tag:blogger.com,1999:blog-929270410515568702.post-64328195661542683972021-04-29T17:08:00.010-07:002021-06-14T14:52:39.995-07:00How to use Spark Operator to run Spark job with Rapids Accelerator<h1 style="text-align: left;">Goal:</h1><p>This article shares the steps on how to run Spark job with Rapids Accelerator using <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator" rel="nofollow" target="_blank">Spark Operator</a> in a Kubernetes Cluster.<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>Spark 3.1.1</p><p>Rapids Accelerator 0.4.1 with cuDF 0.18.1</p><p>Kubernetes Cluster 1.19</p><p>Spark Operator<br /></p><h1 style="text-align: left;">Solution: <br /></h1><p>As per <a href="https://issues.apache.org/jira/browse/SPARK-33005" rel="nofollow" target="_blank">SPARK-33005</a>, Spark on Kubernetes is GA in Spark 3.1.1. <br /></p><p>In the Rapids Accelerator official Doc: <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a>, it shares the steps on how to use spark-submit/spark-shell to directly submit Spark jobs into a Kubernetes Cluster.</p><p>This article will mainly focus on how to use <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator" rel="nofollow" target="_blank">Spark Operator</a> to do the same thing.</p><p>Here we assume you already have a working Kubernetes Cluster with NVIDIA GPU support, and also built your own Spark docker image by following the above <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a>. </p><h2 style="text-align: left;">1. Copy your application into the docker image</h2><p>When following above <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a>, make sure you modify the Dockerfile to copy your application(such as jars, python files) into the docker image. </p><p>This is because, as of today, as per the <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md" rel="nofollow" target="_blank">Spark Operator user guide</a> : "A SparkApplication should set .spec.deployMode to <b><span style="color: red;">cluster</span></b>, as <b>client is not currently implemented</b>. The driver pod will then run spark-submit in client mode internally to run the driver program. "</p><p>Here we created a below test.py and copy it into docker image under directory "/opt/sparkRapidsPlugin": <br /></p>
<pre class="brush:python; toolbar: false; auto-links: false">from pyspark.sql import SQLContext<br />from pyspark import SparkConf<br />from pyspark import SparkContext<br />conf = SparkConf()<br />sc = SparkContext.getOrCreate()<br />sqlContext = SQLContext(sc)<br />df=sqlContext.createDataFrame([1,2,3], "int").toDF("value")<br />df.createOrReplaceTempView("df")<br />sqlContext.sql("SELECT * FROM df WHERE value<>1").explain()<br />sqlContext.sql("SELECT * FROM df WHERE value<>1").show()<br />sc.stop()</pre>
<p>Modify Dockerfile to add below:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">COPY test.py /opt/sparkRapidsPlugin</pre>
<h2 style="text-align: left;">2. Create spark-operator in a namespace named "spark-operator" using helm chart.<br /></h2><p>Here we just follow the <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md" rel="nofollow" target="_blank">Spark Operator quick start guide</a>. <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator<br />helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace</pre>
<p>In the end, if you want to delete this chart, use below command:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">helm uninstall my-release --namespace spark-operator</pre>
<h2 style="text-align: left;">3. Check what objects are created in Kubernetes Cluster <br /></h2>
<pre class="brush:bash; toolbar: false; auto-links: false">$ kubectl get pods -n spark-operator<br />NAME READY STATUS RESTARTS AGE<br />my-release-spark-operator-599f575d4-cjlmz 1/1 Running 0 62s<br /><br />$ kubectl get deployment -n spark-operator<br />NAME READY UP-TO-DATE AVAILABLE AGE<br />my-release-spark-operator 1/1 1 1 101s<br /><br />$ kubectl get clusterrolebinding |grep spark-operator<br />my-release-spark-operator ClusterRole/my-release-spark-operator 5m28s<br /><br />$ kubectl describe clusterrolebinding my-release-spark-operator<br />Name: my-release-spark-operator<br />Labels: app.kubernetes.io/instance=my-release<br /> app.kubernetes.io/managed-by=Helm<br /> app.kubernetes.io/name=spark-operator<br /> app.kubernetes.io/version=v1beta2-1.2.3-3.1.1<br /> helm.sh/chart=spark-operator-1.1.0<br />Annotations: meta.helm.sh/release-name: my-release<br /> meta.helm.sh/release-namespace: spark-operator<br />Role:<br /> Kind: ClusterRole<br /> Name: my-release-spark-operator<br />Subjects:<br /> Kind Name Namespace<br /> ---- ---- ---------<br /> ServiceAccount my-release-spark-operator spark-operator<br /><br /><br />$ kubectl get role -n spark-operator<br />NAME CREATED AT<br />spark-role 2021-04-29T16:16:32Z</pre>
<h2 style="text-align: left;">4. Check the status of spark-operator <br /></h2>
<pre class="brush:bash; toolbar: false; auto-links: false">$ helm status --namespace spark-operator my-release<br />NAME: my-release<br />LAST DEPLOYED: Thu Apr 29 09:20:14 2021<br />NAMESPACE: spark-operator<br />STATUS: deployed<br />REVISION: 1<br />TEST SUITE: None</pre>
<h2 style="text-align: left;">5. Run a Spark Pi job without using Rapids Accelerator <br /></h2><p>This is just to make sure Spark Operator itself is working fine without adding complexity of troubleshooting. <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git<br />cd spark-on-k8s-operator<br />kubectl apply -f examples/spark-pi.yaml</pre>
<p>Note: Driver Pod will use "spark" service account by default. So make sure you either have granted enough privileges to "spark" or modify the yaml file as whatever you need. <br /></p><p>It should completed successfully: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">$ kubectl get pods<br />NAME READY STATUS RESTARTS AGE<br />spark-pi-driver 0/1 Completed 0 48s</pre>
<p>You can also check the status of sparkapplications (custom resource definition aka CRD) using kubectl:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">$ kubectl get sparkapplications spark-pi -o=yaml<br />...<br />status:<br /> applicationState:<br /> state: COMPLETED<br />...</pre>
<p>Or describe it to get the events: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">$ kubectl describe sparkapplication spark-pi<br />...<br />Events:<br /> Type Reason Age From Message<br /> ---- ------ ---- ---- -------<br /> Normal SparkApplicationAdded 7m22s spark-operator SparkApplication spark-pi was added, enqueuing it for submission<br /> Normal SparkApplicationSubmitted 7m20s spark-operator SparkApplication spark-pi was submitted successfully<br /> Normal SparkDriverRunning 7m9s spark-operator Driver spark-pi-driver is running<br /> Normal SparkExecutorPending 7m4s spark-operator Executor spark-pi-d25689791e785e41-exec-1 is pending<br /> Normal SparkExecutorRunning 7m1s spark-operator Executor spark-pi-d25689791e785e41-exec-1 is running<br /> Normal SparkExecutorCompleted 6m58s (x2 over 6m58s) spark-operator Executor spark-pi-d25689791e785e41-exec-1 completed<br /> Normal SparkDriverCompleted 6m58s (x2 over 6m58s) spark-operator Driver spark-pi-driver completed<br /> Normal SparkApplicationCompleted 6m58s spark-operator SparkApplication spark-pi completed<br />...</pre>
<h2 style="text-align: left;">6. Build sparkctl</h2><p>sparkctl has more functionality to support Spark on K8s. It is shipped inside the downloaded Spark Operator repo.<br /></p><p>Let's build it and use it instead of kubectl.</p><h3 style="text-align: left;">6.1 Install Golang<br /></h3><p>Follow <a href="https://golang.org/doc/install" rel="nofollow" target="_blank">https://golang.org/doc/install</a> to install Golang on Mac.</p><p>After that, set the PATH in .bash_profile:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">export PATH=$PATH:/usr/local/go/bin</pre><h3 style="text-align: left;">6.2 Build sparkctl<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">cd sparkctl<br />go build -o sparkctl<br /></pre>
<p>After that, set PATH for this sparkctl as well.</p><h2 style="text-align: left;">7. Run a Spark job with Rapids Accelerator</h2><h3 style="text-align: left;">7.1 Create a yaml file named testpython-rapids.yaml<br /></h3>
<pre class="brush:text; toolbar: false; auto-links: false">apiVersion: "sparkoperator.k8s.io/v1beta2"<br />kind: SparkApplication<br />metadata:<br /> name: testpython-rapids<br /> namespace: default<br />spec:<br /> sparkConf:<br /> "spark.ui.port": "4045"<br /> "spark.rapids.sql.concurrentGpuTasks": "1"<br /> "spark.executor.resource.gpu.amount": "1"<br /> "spark.task.resource.gpu.amount": "1"<br /> "spark.executor.memory": "1g"<br /> "spark.rapids.memory.pinnedPool.size": "2g"<br /> "spark.executor.memoryOverhead": "3g"<br /> "spark.locality.wait": "0s"<br /> "spark.sql.files.maxPartitionBytes": "512m"<br /> "spark.sql.shuffle.partitions": "10"<br /> "spark.plugins": "com.nvidia.spark.SQLPlugin"<br /> "spark.executor.resource.gpu.discoveryScript": "/opt/sparkRapidsPlugin/getGpusResources.sh"<br /> "spark.executor.resource.gpu.vendor": "nvidia.com"<br /> "spark.executor.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar:/opt/sparkRapidsPlugin/cudf.jar"<br /> "spark.driver.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar:/opt/sparkRapidsPlugin/cudf.jar"<br /> type: Python<br /> pythonVersion: 3<br /> mode: cluster<br /> image: "<image>"<br /> imagePullPolicy: Always<br /> mainApplicationFile: "local:///opt/sparkRapidsPlugin/test.py"<br /> sparkVersion: "3.1.1"<br /> restartPolicy:<br /> type: Never<br /> volumes:<br /> - name: "test-volume"<br /> hostPath:<br /> path: "/tmp"<br /> type: Directory<br /> driver:<br /> cores: 1<br /> coreLimit: "1200m"<br /> memory: "1024m"<br /> labels:<br /> version: 3.1.1<br /> serviceAccount: spark<br /> volumeMounts:<br /> - name: "test-volume"<br /> mountPath: "/tmp"<br /> executor:<br /> cores: 1<br /> instances: 1<br /> memory: "5000m"<br /> gpu:<br /> name: "nvidia.com/gpu"<br /> quantity: 1<br /> labels:<br /> version: 3.1.1<br /> volumeMounts:<br /> - name: "test-volume"<br /> mountPath: "/tmp"</pre>
<h3 style="text-align: left;">7.2 Submit testpython-rapids<br /></h3>
<pre class="brush:text; toolbar: false; auto-links: false">sparkctl create testpython-rapids.yaml</pre>
<h3 style="text-align: left;">7.3 Check status of testpython-rapids<br /></h3>
<pre class="brush:text; toolbar: false; auto-links: false">sparkctl status testpython-rapids</pre>
<h3 style="text-align: left;">7.4 Check driver log<br /></h3><pre class="brush:text; toolbar: false; auto-links: false">sparkctl log testpython-rapids</pre><p>It should show GPU related query plan and the job results.</p>
<pre class="brush:text; toolbar: false; auto-links: false">== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuFilter (gpuisnotnull(value#0) AND NOT (value#0 = 1))<br /> +- GpuRowToColumnar TargetSize(2147483647)<br /> +- *(1) Scan ExistingRDD[value#0]</pre>
<h3 style="text-align: left;">7.5 Check executor log (when it is running)<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">sparkctl log testpython-rapids -e 1</pre>
<h3 style="text-align: left;">7.6 Check the events<br /></h3><pre class="brush:bash; toolbar: false; auto-links: false">sparkctl event testpython-rapids</pre>
<h3 style="text-align: left;">7.7 port forwarding (when driver is running)<br /></h3><pre class="brush:bash; toolbar: false; auto-links: false">sparkctl forward testpython-rapids --local-port 1234 --remote-port 4045</pre><p>Then open localhost:1234 in browser. </p><p>Note: here the remote port 4045 is what we set for "spark.ui.port" in the testpython-rapids.yaml.<br /></p><h3 style="text-align: left;">7.8 Delete the spark job<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">sparkctl delete testpython-rapids</pre>
<p><br /></p><h1 style="text-align: left;">Reference:</h1><ul style="text-align: left;"><li><a href="https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md">https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md</a></li><li><a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md" rel="nofollow" target="_blank">https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md</a><br /></li><li><a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md" rel="nofollow" target="_blank">https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md</a><br /></li></ul><p><br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-56078363623768855412021-04-27T20:18:00.001-07:002021-04-28T11:00:54.577-07:00Rapids Accelerator compatibility related to spark.sql.legacy.parquet.datetimeRebaseModeInWrite<h1 style="text-align: left;">Goal:</h1><p>This article talked about the compatibility of <a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">Rapids Accelerator for Spark </a>regarding parquet writing related to parameters <b><i>spark.sql.legacy.parquet.datetimeRebaseModeInWrite</i></b> etc.<br /></p><span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p>Spark 3.1.1</p><p>Rapids Accelerator for Spark 0.5 snapshot <br /></p><h1 style="text-align: left;">Solution:</h1><p>Spark 3.0 made the change to use Proleptic Gregorian calendar instead of hybrid Gregorian+Julian calendar. So it caused some trouble when reading/writing to/from old "legacy" format from Spark 2.x.</p><p>Here is a <a href="https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read#backward_compatibility" rel="nofollow" target="_blank">nice blog</a> to explain the change, and I would strongly recommend read it firstly.<br /></p><ul style="text-align: left;"><li><a href="https://issues.apache.org/jira/browse/SPARK-31405" rel="nofollow" target="_blank">SPARK-31405</a> (starting from 3.0) introduced parameter <b><i>spark.sql.legacy.parquet.datetimeRebaseModeInWrite</i></b> which influences on writes of the following parquet logical types:DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS. </li><li><a href="https://issues.apache.org/jira/browse/SPARK-33210" rel="nofollow" target="_blank">SPARK-33210</a> (starting from 3.1) introduced another parameter <b><i>spark.sql.legacy.parquet.int96RebaseModeInWrite</i></b> for INT96 type(timestamp).<br /></li></ul><p>Here are 3 values:</p><ul style="text-align: left;"><li><b>EXCEPTION</b> (Default): Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li><li><b>LEGACY</b>: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files</li><li><b>CORRECTED</b>: Spark will not do rebase and write the dates/timestamps as it is.</li></ul><p>In CPU mode, let's firstly look at the behaviors.</p><h2 style="text-align: left;">1. CPU Mode</h2><h3 style="text-align: left;">1.1 <b>EXCEPTION (Default)</b></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">import java.sql.Date<br />spark.conf.set("spark.rapids.sql.enabled", false)<br />spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "EXCEPTION")<br />Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_exception")</pre>
<p>It will fail with: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: <br />writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, <br />as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. <br />See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. <br />Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, <br />if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.</pre>
<h3 style="text-align: left;">1.2 LEGACY<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.conf.set("spark.rapids.sql.enabled", false)<br />spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")<br />Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")<br />spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("date_legacy")<br />spark.sql("SELECT * FROM date_legacy").explain<br />spark.sql("SELECT * FROM date_legacy").show</pre>
<p>Output:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_legacy").explain<br />== Physical Plan ==<br />*(1) ColumnarToRow<br />+- FileScan parquet [dt#30] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_legacy], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date><br /><br /><br />scala> spark.sql("SELECT * FROM date_legacy").show<br />+----------+<br />| dt|<br />+----------+<br />|1500-12-25|<br />+----------+</pre>
<h3 style="text-align: left;">1.3 CORRECTED</h3>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.conf.set("spark.rapids.sql.enabled", false)<br />spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")<br />Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")<br />spark.read.parquet("/tmp/testparquet_corrected").createOrReplaceTempView("date_corrected")<br />spark.sql("SELECT * FROM date_corrected").explain<br />spark.sql("SELECT * FROM date_corrected").show<br /></pre>
<p>Output:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_corrected").explain<br />== Physical Plan ==<br />*(1) ColumnarToRow<br />+- FileScan parquet [dt#46] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_corrected], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date><br /><br /><br />scala> spark.sql("SELECT * FROM date_corrected").show<br />+----------+<br />| dt|<br />+----------+<br />|1500-12-25|<br />+----------+</pre>
<h2 style="text-align: left;">2. GPU Mode</h2><h3 style="text-align: left;">2.1 <b>EXCEPTION (Default)</b></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">import java.sql.Date<br />spark.conf.set("spark.rapids.sql.enabled", true)<br />spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "EXCEPTION")<br />Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_exception")</pre>
<p>It will fail with: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: <br />writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, <br />as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. <br />See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. <br />Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, <br />if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.</pre>
<h3 style="text-align: left;">2.2 LEGACY</h3>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.conf.set("spark.rapids.sql.enabled", true)<br />spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")<br />Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")<br />spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("date_legacy")<br />spark.sql("SELECT * FROM date_legacy").explain<br />spark.sql("SELECT * FROM date_legacy").show</pre>
<p>The data writing can finish successfully since we use LEGACY value, but it is done by CPU instead of GPU(see the warning message"Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because LEGACY rebase mode for dates and timestamps is not supported"):<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")<br />21/04/28 01:29:27 WARN GpuOverrides:<br />!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced<br /> !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because LEGACY rebase mode for dates and timestamps is not supported<br /> !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found<br /> @Expression <AttributeReference> dt#66 could run on GPU</pre>
<p>Spark UI can show the query plan which is on CPU as well:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwENHmBnHitG-gnevWdrCJEPhe4Hcs7Smq7C-wxcFQ-e2t4J99ULsXKg2fbcSDZtSUudLMxNqFotKFK5c6oaAchLfa986JVQ0dSF8jTGzWBqxAwplzsB3o1qUdQHpvJ5-4K3h8AFizR1c/s692/Screen+Shot+2021-04-27+at+6.36.52+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="564" data-original-width="692" height="522" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwENHmBnHitG-gnevWdrCJEPhe4Hcs7Smq7C-wxcFQ-e2t4J99ULsXKg2fbcSDZtSUudLMxNqFotKFK5c6oaAchLfa986JVQ0dSF8jTGzWBqxAwplzsB3o1qUdQHpvJ5-4K3h8AFizR1c/w640-h522/Screen+Shot+2021-04-27+at+6.36.52+PM.png" width="640" /></a></div>The data reading fails with below error message and suggest us to set <b><i>spark.sql.legacy.parquet.datetimeRebaseModeInRead</i></b> to CORRECTED.<br />
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_legacy").explain<br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuFileGpuScan parquet [dt#69] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_legacy], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date><br /><br /><br />scala> spark.sql("SELECT * FROM date_legacy").show<br />21/04/28 01:29:28 WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 19) (111.111.111.111 executor 0): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. The RAPIDS Accelerator does not support reading these 'LEGACY' files. To do so you should disable Parquet support in the RAPIDS Accelerator or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.</pre>
<p>Even after setting <b><i>spark.sql.legacy.parquet.datetimeRebaseModeInRead</i></b> to CORRECTED or LEGACY, it still fails with the same error.<br /></p><h3 style="text-align: left;">2.3 CORRECTED</h3>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.conf.set("spark.rapids.sql.enabled", true)<br />spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")<br />Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")<br />spark.read.parquet("/tmp/testparquet_corrected").createOrReplaceTempView("date_corrected")<br />spark.sql("SELECT * FROM date_corrected").explain<br />spark.sql("SELECT * FROM date_corrected").show</pre>
<p>The data writing can finish successfully on GPU since we use CORRECTED value:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> Seq(Date.valueOf("1500-12-25")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")<br />21/04/28 01:58:23 WARN GpuOverrides:<br /> !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found<br /> @Expression <AttributeReference> dt#140 could run on GPU</pre>
<p>Spark UI can show the query plan which is on <span style="color: red;"><b>GPU</b></span> as well:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwG3unWXDBD_6nTLkr2DwkLIHPClr1LjLdO9KNRfkNuEhrIV02v1Pajn8MO2Wgdgrbvv01D9CCflMHICR2OG3hWUaJZBvQclrULmoGELcP2qasLQPwccA4gvwZm8GktMSigoKmGxs5__M/s1168/Screen+Shot+2021-04-27+at+7.01.05+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1168" data-original-width="814" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwG3unWXDBD_6nTLkr2DwkLIHPClr1LjLdO9KNRfkNuEhrIV02v1Pajn8MO2Wgdgrbvv01D9CCflMHICR2OG3hWUaJZBvQclrULmoGELcP2qasLQPwccA4gvwZm8GktMSigoKmGxs5__M/w446-h640/Screen+Shot+2021-04-27+at+7.01.05+PM.png" width="446" /></a></div>The data reading also works fine on GPU:<br />
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_corrected").explain<br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuFileGpuScan parquet [dt#143] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet_corrected], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<dt:date><br /><br /><br />scala> spark.sql("SELECT * FROM date_corrected").show<br />+----------+<br />| dt|<br />+----------+<br />|1500-12-25|<br />+----------+</pre>
<h2 style="text-align: left;">3. Int96 timestamp tests<br /></h2><p>Of course, we can do similar tests for int96 timestamp type using below scripts. </p><p>Here I will let you try it out. <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.conf.set("spark.rapids.sql.enabled", true)<br /><br />spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "EXCEPTION")<br />Seq(java.sql.Timestamp.valueOf("1500-01-01 00:00:00")).toDF("ts").write.format("parquet").mode("overwrite").save("/tmp/testparquet_exception")<br />spark.read.parquet("/tmp/testparquet_exception").createOrReplaceTempView("ts_exception")<br />spark.sql("SELECT * FROM ts_exception").explain<br />spark.sql("SELECT * FROM ts_exception").show<br /><br />spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")<br />Seq(java.sql.Timestamp.valueOf("1500-01-01 00:00:00")).toDF("ts").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")<br />spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("ts_legacy")<br />spark.sql("SELECT * FROM ts_legacy").explain<br />spark.sql("SELECT * FROM ts_legacy").show<br /><br />spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")<br />Seq(java.sql.Timestamp.valueOf("1500-01-01 00:00:00")).toDF("ts").write.format("parquet").mode("overwrite").save("/tmp/testparquet_corrected")<br />spark.read.parquet("/tmp/testparquet_corrected").createOrReplaceTempView("ts_corrected")<br />spark.sql("SELECT * FROM ts_corrected").explain<br />spark.sql("SELECT * FROM ts_corrected").show</pre>
<h2 style="text-align: left;">4. 1582-10-15 behaviors<br /></h2><p>As you remember, the error message shows that "reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous".</p><p>Here we focus on date which is 1582-10-15.</p><p>Let's use below sample test program on both CPU mode and GPU mode, and change the date "1582-10-15" to older dates in the following tests.<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")<br />Seq(Date.valueOf("1582-10-15")).toDF("dt").write.format("parquet").mode("overwrite").save("/tmp/testparquet_legacy")<br /><br />spark.read.parquet("/tmp/testparquet_legacy").createOrReplaceTempView("date_legacy")<br />spark.sql("SELECT * FROM date_legacy").explain<br />spark.sql("SELECT * FROM date_legacy").show</pre>
<h3 style="text-align: left;">4.1 1582-10-15</h3><p>Both CPU and GPU Modes can successfully read it as 1582-10-15: </p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_legacy").show<br />+----------+<br />| dt|<br />+----------+<br />|1582-10-15|<br />+----------+</pre>
<h3 style="text-align: left;">4.2 1582-10-14<br /></h3><p>Both CPU and GPU Modes started to show ambiguous result: 1582-10-<span style="color: red;">24</span> which is "original date"+10:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_legacy").show<br />+----------+<br />| dt|<br />+----------+<br />|1582-10-24|<br />+----------+</pre>
<div style="text-align: left;">This "original date"+10 behavior lasts until 1582-10-05.</div><div style="text-align: left;"><h3 style="text-align: left;">4.3 1582-10-04</h3></div><p>CPU Mode can successfully read it as 1582-10-04 going forward: </p>
<pre class="brush:bash; toolbar: false; auto-links: false">scala> spark.sql("SELECT * FROM date_legacy").show<br />+----------+<br />| dt|<br />+----------+<br />|1582-10-04|<br />+----------+</pre><p><b>However GPU Mode will fail since 1582-10-04:</b></p>
<pre class="brush:bash; toolbar: false; auto-links: false">Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. The RAPIDS Accelerator does not support reading these 'LEGACY' files. To do so you should disable Parquet support in the RAPIDS Accelerator or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.<br /> at org.apache.spark.sql.rapids.execution.TrampolineUtil$.makeSparkUpgradeException(TrampolineUtil.scala:78)<br /> at com.nvidia.spark.RebaseHelper$.newRebaseExceptionInRead(RebaseHelper.scala:83)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$3(GpuParquetScan.scala:1162)<br /> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$2(GpuParquetScan.scala:1160)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$2$adapted(GpuParquetScan.scala:1158)<br /> at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:76)<br /> at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:74)<br /> at com.nvidia.spark.rapids.FileParquetPartitionReaderBase.closeOnExcept(GpuParquetScan.scala:504)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.readToTable(GpuParquetScan.scala:1158)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readBatch$1(GpuParquetScan.scala:1113)<br /> at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)<br /> at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)<br /> at com.nvidia.spark.rapids.FileParquetPartitionReaderBase.withResource(GpuParquetScan.scala:504)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.readBatch(GpuParquetScan.scala:1098)<br /> at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.next(GpuParquetScan.scala:926)<br /> at com.nvidia.spark.rapids.PartitionIterator.hasNext(GpuDataSourceRDD.scala:59)<br /> at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(GpuDataSourceRDD.scala:76)<br /> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)<br /> at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:385)<br /> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)<br /> at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.hasNext(limit.scala:62)<br /> at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:208)<br /> at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:225)<br /> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)<br /> at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)<br /> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)<br /> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)<br /> at org.apache.spark.scheduler.Task.run(Task.scala:131)<br /> at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)<br /> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)<br /> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)<br /> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)<br /> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)<br /> at java.base/java.lang.Thread.run(Thread.java:834)</pre>
<p> </p><p>==<br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-21284232423239270362021-04-20T23:14:00.003-07:002021-04-20T23:44:03.624-07:00Spark Code -- Dig into SparkListenerEvent<h1 style="text-align: left;">Goal:</h1><p>This article digs into different types of SparkListenerEvent in Spark event log with some examples. </p><p>Understanding this can help us know how to pares Spark event log.<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p><a href="https://github.com/apache/spark/tree/v3.1.1" rel="nofollow" target="_blank">Apache Spark 3.1.1 source code </a><br /></p><h1 style="text-align: left;">Solution:</h1><p><b>WARNING: this article will help us understand all below SparkListenerEvent in Spark event log with examples. It may contains lots of Apache Source code analysis. If you do not like reading a bunch of source code, you can stop now.</b> </p><p>As we know, Spark event log can be shown in Spark HistoryServer(SHS) UI nicely. Then why do we try to parse the Spark event log manually? </p><p>The answer is, SHS only shows a small portion of the event log. There are lots of good stuff inside Spark event log such as task metrics, SQL Plan node accumulables, etc. <br /></p><p>Basically event log is a file with different json lines, with each line coming from different Scala case classes which extend a trait(interface) called "SparkListenerEvent". Those definition is inside <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala" rel="nofollow" target="_blank">core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala</a>.</p><p>Spark has its own <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala" rel="nofollow" target="_blank">EventLogFileReaders</a> which is backward compatible, so we do not need to write json parser to parse the json object ourselves. One reason is our own json parser could be out of date as well if event log format changes in the future Spark versions.</p><p>So if our interest is to parse the event log, we can learn how SHS parses it. The logic is inside <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala" rel="nofollow" target="_blank">FsHistoryProvider.scala</a>:<br /></p>
<pre class="brush:java; toolbar: false; auto-links: false">Utils.tryWithResource(EventLogFileReader.openEventLog(lastFile.getPath, fs))</pre>
<p>If we used "<a href="https://stedolan.github.io/jq/" rel="nofollow" target="_blank">jq</a>" to format the event log in a human readable format, you can find the details of each json object. <br /></p><p>Now let's look into each of below 21 types of SparkListenerEvent:</p><p>Some of them are very simple and straightforward, but some of them are very difficult to understand the logic: especially there are 6 different types of events handling SQL plan accumulables with each other, and AQE related events may override the query plan got from previous events.<br /></p><ol style="text-align: left;"><li>SparkListenerLogStart</li><li>SparkListenerResourceProfileAdded</li><li>SparkListenerBlockManagerAdded</li><li>SparkListenerBlockManagerRemoved</li><li>SparkListenerEnvironmentUpdate</li><li>SparkListenerTaskStart</li><li>SparkListenerApplicationStart</li><li>SparkListenerExecutorAdded</li><li>SparkListenerExecutorRemoved</li><li>SparkListenerSQLExecutionStart</li><li>SparkListenerSQLExecutionEnd</li><li>SparkListenerDriverAccumUpdates</li><li>SparkListenerJobStart</li><li>SparkListenerStageSubmitted</li><li>SparkListenerTaskEnd</li><li>SparkListenerStageCompleted</li><li>SparkListenerJobEnd</li><li>SparkListenerTaskGettingResult</li><li>SparkListenerApplicationEnd</li><li>SparkListenerSQLAdaptiveExecutionUpdate</li><li>SparkListenerSQLAdaptiveSQLMetricUpdates<br /></li></ol><h3 style="text-align: left;">1. SparkListenerLogStart <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerLogStart",<br /> "Spark Version": "3.1.1"<br />}</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerLogStart(sparkVersion: String) extends SparkListenerEvent</pre><p>Very straightforward we can get spark version from it.</p><h3 style="text-align: left;">2. SparkListenerResourceProfileAdded<br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerResourceProfileAdded",<br /> "Resource Profile Id": 0,<br /> "Executor Resource Requests": {<br /> "cores": {<br /> "Resource Name": "cores",<br /> "Amount": 16,<br /> "Discovery Script": "",<br /> "Vendor": ""<br /> },<br /> "memory": {<br /> "Resource Name": "memory",<br /> "Amount": 81920,<br /> "Discovery Script": "",<br /> "Vendor": ""<br /> },<br /> "offHeap": {<br /> "Resource Name": "offHeap",<br /> "Amount": 0,<br /> "Discovery Script": "",<br /> "Vendor": ""<br /> },<br /> "gpu": {<br /> "Resource Name": "gpu",<br /> "Amount": 1,<br /> "Discovery Script": "/xxx/xxx/xxx/xxx/examples/src/main/scripts/getGpusResources.sh",<br /> "Vendor": ""<br /> }<br /> },<br /> "Task Resource Requests": {<br /> "cpus": {<br /> "Resource Name": "cpus",<br /> "Amount": 1<br /> },<br /> "gpu": {<br /> "Resource Name": "gpu",<br /> "Amount": 0.25<br /> }<br /> }<br />}</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerResourceProfileAdded(resourceProfile: ResourceProfile)<br /> extends SparkListenerEvent</pre>
<p>What is <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/resource/ResourceProfile.scala" rel="nofollow" target="_blank">ResourceProfile</a>?<br /></p>
<pre class="brush:java; toolbar: false; auto-links: false">class ResourceProfile(<br /> val executorResources: Map[String, ExecutorResourceRequest],<br /> val taskResources: Map[String, TaskResourceRequest])</pre>
<p>What are <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/resource/ExecutorResourceRequest.scala" rel="nofollow" target="_blank">ExecutorResourceRequest</a> and <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/resource/TaskResourceRequests.scala" rel="nofollow" target="_blank">TaskResourceRequest</a>?<br /></p>
<pre class="brush:java; toolbar: false; auto-links: false">class ExecutorResourceRequest(<br /> val resourceName: String,<br /> val amount: Long,<br /> val discoveryScript: String = "",<br /> val vendor: String = "") extends Serializable {<br /> ...<br /><br />class TaskResourceRequests() extends Serializable {<br /> private val _taskResources = new ConcurrentHashMap[String, TaskResourceRequest]()<br /> def requests: Map[String, TaskResourceRequest] = _taskResources.asScala.toMap<br /> def requestsJMap: JMap[String, TaskResourceRequest] = requests.asJava<br /> def cpus(amount: Int): this.type = {<br /> def resource(resourceName: String, amount: Double): this.type = {<br /> ...</pre>
<p>After some digging into, we know SparkListenerResourceProfileAdded contains the executor and task resource requests such as CPU, Memory, GPU, etc. </p><p>For GPU resource, it is a little difficult to get, because we need to get it from a Map instead of getting it directly by calling a method or a function. <br /></p><h3 style="text-align: left;">3. SparkListenerBlockManagerAdded <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerBlockManagerAdded",<br /> "Block Manager ID": {<br /> "Executor ID": "driver",<br /> "Host": "myhostname",<br /> "Port": 44159<br /> },<br /> "Maximum Memory": 3032481792,<br /> "Timestamp": 1618341863606,<br /> "Maximum Onheap Memory": 3032481792,<br /> "Maximum Offheap Memory": 0<br />}</pre>
<p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerBlockManagerAdded(<br /> time: Long,<br /> blockManagerId: BlockManagerId,<br /> maxMem: Long,<br /> maxOnHeapMem: Option[Long] = None,<br /> maxOffHeapMem: Option[Long] = None) extends SparkListenerEvent {<br />}</pre>
<p>What is <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala" rel="nofollow" target="_blank">BlockManagerId.scala</a>? <br /></p>
<pre class="brush:java; toolbar: false; auto-links: false">class BlockManagerId private (<br /> private var executorId_ : String,<br /> private var host_ : String,<br /> private var port_ : Int,<br /> private var topologyInfo_ : Option[String])<br /> extends Externalizable {</pre>
<p>SparkListenerBlockManagerAdded contains Executor's resource information such as executorId, hostname, port, and max memory size.<br /></p><h3 style="text-align: left;">4. SparkListenerBlockManagerRemoved <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerBlockManagerRemoved",<br /> "Block Manager ID": {<br /> "Executor ID": "1",<br /> "Host": "myhostname",<br /> "Port": 12345<br /> },<br /> "Timestamp": 1111111111111<br />}</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)</pre>
<p>SparkListenerBlockManagerRemoved contains the timestamp when an executor gets removed.</p><p>Normally it means some executors fails with some error and we may see it come together with SparkListenerExecutorRemoved. <br /></p><h3 style="text-align: left;">5. SparkListenerEnvironmentUpdate <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerEnvironmentUpdate",<br /> "JVM Information": {<br /> "Java Home": "/xxx/xxx/xxx/envs/xxx",<br /> "Java Version": "11.0.9.1-internal (Oracle Corporation)",<br /> "Scala Version": "version 2.12.10"<br /> },<br /> "Spark Properties": {<br /> "spark.rapids.sql.exec.CollectLimitExec": "true",<br /> "spark.executor.resource.gpu.amount": "1",<br /> "spark.rapids.sql.concurrentGpuTasks": "1",<br /> ...<br /> }<br /> "Hadoop Properties": {<br /> "yarn.resourcemanager.amlauncher.thread-count": "50",<br /> "dfs.namenode.resource.check.interval": "5000",<br /> ...<br /> }<br /> "System Properties": {<br /> "java.io.tmpdir": "/tmp",<br /> "line.separator": "\n", <br /> ... <br /> }<br /> "Classpath Entries": {<br /> "/home/xxx/spark/jars/curator-framework-2.7.1.jar": "System Classpath",<br /> "/home/xxx/spark/jars/parquet-encoding-1.10.1.jar": "System Classpath",<br /> "/home/xxx/spark/jars/commons-dbcp-1.4.jar": "System Classpath",<br /> ...<br /> }<br />}</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerEnvironmentUpdate(environmentDetails: Map[String, Seq[(String, String)]])<br /> extends SparkListenerEvent</pre>
<p>SparkListenerEnvironmentUpdate is a Map which contains the Spark/Hadoop/System/... properties.</p><p>It is useful for us to do some parameter checks. <br /></p><h3 style="text-align: left;">6. SparkListenerTaskStart <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerTaskStart",<br /> "Stage ID": 0,<br /> "Stage Attempt ID": 0,<br /> "Task Info": {<br /> "Task ID": 0,<br /> "Index": 0,<br /> "Attempt": 0,<br /> "Launch Time": 1618341870400,<br /> "Executor ID": "0",<br /> "Host": "111.111.111.111",<br /> "Locality": "PROCESS_LOCAL",<br /> "Speculative": false,<br /> "Getting Result Time": 0,<br /> "Finish Time": 0,<br /> "Failed": false,<br /> "Killed": false,<br /> "Accumulables": []<br /> }<br />}</pre>
<p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)</pre><p>What is <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala" rel="nofollow" target="_blank">TaskInfo</a>? <br /></p>
<pre class="brush:text; toolbar: false; auto-links: false">class TaskInfo(<br /> val taskId: Long,<br /> /**<br /> * The index of this task within its task set. Not necessarily the same as the ID of the RDD<br /> * partition that the task is computing.<br /> */<br /> val index: Int,<br /> val attemptNumber: Int,<br /> val launchTime: Long,<br /> val executorId: String,<br /> val host: String,<br /> val taskLocality: TaskLocality.TaskLocality,<br /> val speculative: Boolean) {</pre>
<p>SparkListenerTaskStart contains the task start time, related executor information.</p><p>Note: Normally the accumulables are empty in the beginning. <br /></p><h3 style="text-align: left;">7. SparkListenerApplicationStart <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerApplicationStart",<br /> "App Name": "Spark Pi",<br /> "App ID": "app-20210413122423-0000",<br /> "Timestamp": 1618341862473,<br /> "User": "xxxx"<br />}</pre>
<p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerApplicationStart(<br /> appName: String,<br /> appId: Option[String],<br /> time: Long,<br /> sparkUser: String,<br /> appAttemptId: Option[String],<br /> driverLogs: Option[Map[String, String]] = None,<br /> driverAttributes: Option[Map[String, String]] = None) extends SparkListenerEvent</pre>
<p>SparkListenerApplicationStart contains the application start time, application name, application ID and user name.</p><p>Normally only one of such event in each event log.<br /></p><h3 style="text-align: left;">8. SparkListenerExecutorAdded <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerExecutorAdded",<br /> "Timestamp": 1618341865601,<br /> "Executor ID": "0",<br /> "Executor Info": {<br /> "Host": "111.111.111.111",<br /> "Total Cores": 16,<br /> "Log Urls": {<br /> "stdout": "http://111.111.111.111:8081/logPage/?appId=app-20210413122423-0000&executorId=0&logType=stdout",<br /> "stderr": "http://111.111.111.111:8081/logPage/?appId=app-20210413122423-0000&executorId=0&logType=stderr"<br /> },<br /> "Attributes": {},<br /> "Resources": {<br /> "gpu": {<br /> "name": "gpu",<br /> "addresses": [<br /> "0"<br /> ]<br /> }<br /> },<br /> "Resource Profile Id": 0<br /> }<br />} </pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerExecutorAdded(time: Long, executorId: String, executorInfo: ExecutorInfo)</pre><p>What is <a href="https://github.com/apache/spark/blob/v3.1.1//core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorInfo.scala" rel="nofollow" target="_blank">ExecutorInfo</a>? <br /></p><pre class="brush:java; toolbar: false; auto-links: false">class ExecutorInfo(<br /> val executorHost: String,<br /> val totalCores: Int,<br /> val logUrlMap: Map[String, String],<br /> val attributes: Map[String, String],<br /> val resourcesInfo: Map[String, ResourceInformation],<br /> val resourceProfileId: Int) { </pre>
<p>SparkListenerExecutorAdded contains the timestamp, and executor information. </p><p>Note that, it is related to a resource profile. <br /></p><h3 style="text-align: left;">9. SparkListenerExecutorRemoved <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerExecutorRemoved",<br /> "Timestamp": 1111111111111,<br /> "Executor ID": "1",<br /> "Removed Reason": "Container from a bad node: container_1111111111111_1111_11_111111 on host: abc.abc.abc.abc"<br />}</pre>
<p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerExecutorRemoved(time: Long, executorId: String, reason: String)</pre><p>SparkListenerExecutorRemoved contains the timestamp and the reason why an executor gets removed.</p><p>Normally it means executor fails due to some reason such as OOM. <br /></p><h3 style="text-align: left;">10. SparkListenerSQLExecutionStart <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart",<br /> "executionId": 3,<br /> "description": "select count(*) from customer a, customer b where a.c_customer_id=b.c_customer_id+10",<br /> "details": "org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)\njava.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\njava.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\njava.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.base/java.lang.reflect.Method.invoke(Method.java:566)\norg.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)\norg.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)\norg.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)\norg.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)\norg.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)\norg.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)\norg.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)\norg.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)",<br /> "physicalPlanDescription": "== Physical Plan ==\nGpuColumnarToRow (14)\n+- GpuHashAggregate (13)\n +- GpuShuffleCoalesce (12)\n +- GpuColumnarExchange (11)\n +- GpuHashAggregate (10)\n +- GpuProject (9)\n +- GpuBroadcastHashJoin (8)\n :- GpuCoalesceBatches (3)\n : +- GpuFilter (2)\n : +- GpuScan parquet tpcds.customer (1)\n +- GpuBroadcastExchange (7)\n +- GpuCoalesceBatches (6)\n +- GpuFilter (5)\n +- GpuScan parquet tpcds.customer (4)\n\n\n(1) GpuScan parquet tpcds.customer\nOutput [1]: [c_customer_id#2]\nBatched: true\nLocation: InMemoryFileIndex [file:/home/xxxxx/data/tpcds_100G_parquet/customer]\nPushedFilters: [IsNotNull(c_customer_id)]\nReadSchema: struct<c_customer_id:string>\n\n(2) GpuFilter\nInput [1]: [c_customer_id#2]\nArguments: gpuisnotnull(c_customer_id#2)\n\n(3) GpuCoalesceBatches\nInput [1]: [c_customer_id#2]\nArguments: TargetSize(2147483647)\n\n(4) GpuScan parquet tpcds.customer\nOutput [1]: [c_customer_id#27]\nBatched: true\nLocation: InMemoryFileIndex [file:/home/xxxxx/data/tpcds_100G_parquet/customer]\nPushedFilters: [IsNotNull(c_customer_id)]\nReadSchema: struct<c_customer_id:string>\n\n(5) GpuFilter\nInput [1]: [c_customer_id#27]\nArguments: gpuisnotnull(c_customer_id#27)\n\n(6) GpuCoalesceBatches\nInput [1]: [c_customer_id#27]\nArguments: TargetSize(2147483647)\n\n(7) GpuBroadcastExchange\nInput [1]: [c_customer_id#27]\nArguments: HashedRelationBroadcastMode(List(knownfloatingpointnormalized(normalizenanandzero((cast(input[0, string, false] as double) + 10.0)))),false), [id=#97]\n\n(8) GpuBroadcastHashJoin\nLeft output [1]: [c_customer_id#2]\nRight output [1]: [c_customer_id#27]\nArguments: [gpuknownfloatingpointnormalized(gpunormalizenanandzero(cast(c_customer_id#2 as double)))], [gpuknownfloatingpointnormalized(gpunormalizenanandzero((cast(c_customer_id#27 as double) + 10.0)))], Inner, GpuBuildRight\n\n(9) GpuProject\nInput [2]: [c_customer_id#2, c_customer_id#27]\n\n(10) GpuHashAggregate\nInput: []\nKeys: []\nFunctions [1]: [partial_gpucount(1)]\nAggregate Attributes [1]: [count#46L]\nResults [1]: [count#47L]\n\n(11) GpuColumnarExchange\nInput [1]: [count#47L]\nArguments: gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#101]\n\n(12) GpuShuffleCoalesce\nInput [1]: [count#47L]\nArguments: 2147483647\n\n(13) GpuHashAggregate\nInput [1]: [count#47L]\nKeys: []\nFunctions [1]: [gpucount(1)]\nAggregate Attributes [1]: [count(1)#25L]\nResults [1]: [count(1)#25L AS count(1)#44L]\n\n(14) GpuColumnarToRow\nInput [1]: [count(1)#44L]\nArguments: false\n\n",<br /> "sparkPlanInfo": {<br /> "nodeName": "GpuColumnarToRow",<br /> "simpleString": "GpuColumnarToRow false",<br /> "children": [<br /> {<br /> "nodeName": "GpuHashAggregate",<br /> "simpleString": "GpuHashAggregate(keys=[], functions=[gpucount(1)]), filters=List(None))",<br /> "children": [<br /> ...<br /> "children": [<br /> {<br /> "nodeName": "GpuScan parquet tpcds.customer",<br /> "simpleString": "GpuFileGpuScan parquet tpcds.customer[c_customer_id#2] Batched: true, DataFilters: [isnotnull(c_customer_id#2)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxxxx/data/tpcds_100G_parquet/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_id)], ReadSchema: struct<c_customer_id:string>",<br /> "children": [],<br /> "metadata": {},<br /> "metrics": [<br /> {<br /> "name": "number of files read",<br /> "accumulatorId": 209,<br /> "metricType": "sum"<br /> },</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerSQLExecutionStart(<br /> executionId: Long,<br /> description: String,<br /> details: String,<br /> physicalPlanDescription: String,<br /> sparkPlanInfo: SparkPlanInfo,<br /> time: Long)<br /> extends SparkListenerEvent</pre>What is <a href="https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanInfo.scala" rel="nofollow" target="_blank">SparkPlanInfo</a>? <br />
<pre class="brush:java; toolbar: false; auto-links: false">class SparkPlanInfo(<br /> val nodeName: String,<br /> val simpleString: String,<br /> val children: Seq[SparkPlanInfo],<br /> val metadata: Map[String, String],<br /> val metrics: Seq[SQLMetricInfo]) {</pre>
<p>What is <a href="https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetricInfo.scala" rel="nofollow" target="_blank">SQLMetricInfo</a>? <br /></p><pre class="brush:java; toolbar: false; auto-links: false">class SQLMetricInfo(<br /> val name: String,<br /> val accumulatorId: Long,<br /> val metricType: String) </pre><p> Now we are getting the complex part. </p><p>SparkListenerSQLExecutionStart contains the query plan, and its accumulables(metrics) definition.</p><p>Remember that here the query plan information may be overridden by upcoming AQE related events SparkListenerSQLAdaptiveExecutionUpdate;</p><p>And the accumulables(metrics) definition could be overriden by upcoming AQE related events SparkListenerSQLAdaptiveSQLMetricUpdates.</p><p>So none of them are final. Please remember they may change later when parsing this event.<br /></p><p>Note: The SQL plan accumulables are associated with its SQL Plan Node by nodeID!<br /></p><p>For example, when the final parsing is done, it should show the mapping relationship between SQL plan nodeID <=> accumulatorId:</p>
<pre class="brush:text; toolbar: false; auto-links: false">+-----+------+---------------------+-------------+-------------------------+------------+----------+<br />|sqlID|nodeID|nodeName |accumulatorId|name |max_value |metricType|<br />+-----+------+---------------------+-------------+-------------------------+------------+----------+<br />|11 |5 |Scan parquet |123 |number of output rows |11 |sum |<br />|11 |5 |Scan parquet |124 |number of files read |1 |sum |<br />|11 |5 |Scan parquet |125 |metadata time |1 |timing |<br />|11 |5 |Scan parquet |126 |size of files read |1111 |size |<br />|11 |5 |Scan parquet |127 |scan time |11 |timing |</pre>
<h3 style="text-align: left;">11. SparkListenerSQLExecutionEnd <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd",<br /> "executionId": 0,<br /> "time": 1617729547596<br />} </pre>
<p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)</pre><p>Easy: it contains the SQL end timestamp. If we map the end timestamp to previous start time, we can get the SQL duration in ms.<br /></p><h3 style="text-align: left;">12. SparkListenerDriverAccumUpdates <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates",<br /> "executionId": 2,<br /> "accumUpdates": [<br /> [<br /> 67,<br /> 1<br /> ],<br /> [<br /> 68,<br /> 2<br /> ],<br /> [<br /> 69,<br /> 106281839<br /> ]<br /> ]<br />}</pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false"> * @param executionId The execution id for a query, so we can find the query plan.<br /> * @param accumUpdates Map from accumulator id to the metric value (metrics are always 64-bit ints).<br /> <br />case class SparkListenerDriverAccumUpdates(<br /> executionId: Long,<br /> @JsonDeserialize(contentConverter = classOf[LongLongTupleConverter])<br /> accumUpdates: Seq[(Long, Long)])</pre>
<p>SparkListenerDriverAccumUpdates mainly sends the accumulator id => accumulator value pairs.</p><p>To figure out what does this accumulator mean? we need to join previous SQLMetricInfo got from SparkListenerSQLExecutionStart and possibly upcoming SparkListenerSQLAdaptiveSQLMetricUpdates.</p><p>So we need to wait for all of the events SparkListenerSQLExecutionStart and SparkListenerSQLAdaptiveSQLMetricUpdates have been processed, and then we match the accumulator id to get the accumulator name and its associated query plan node.<br /></p><h3 style="text-align: left;">13. SparkListenerJobStart <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerJobStart",<br /> "Job ID": 0,<br /> "Submission Time": 1617729577252,<br /> "Stage Infos": [<br /> {<br /> "Stage ID": 0,<br /> "Stage Attempt ID": 0,<br /> "Stage Name": "executeColumnar at GpuShuffleCoalesceExec.scala:67",<br /> "Number of Tasks": 16,<br /> "RDD Info": [<br /> {<br /> "RDD ID": 3,<br /> "Name": "MapPartitionsRDD",<br /> "Scope": "{\"id\":\"7\",\"name\":\"GpuColumnarExchange\"}",<br /> "Callsite": "executeColumnar at GpuShuffleCoalesceExec.scala:67",<br /> "Parent IDs": [<br /> 2<br /> ],<br /> "Storage Level": {<br /> "Use Disk": false,<br /> "Use Memory": false,<br /> "Deserialized": false,<br /> "Replication": 1<br /> },<br /> "Barrier": false,<br /> "Number of Partitions": 16,<br /> "Number of Cached Partitions": 0,<br /> "Memory Size": 0,<br /> "Disk Size": 0<br /> },<br /> ...<br /> "Accumulables": [],<br /> "Resource Profile Id": 0<br /> }<br /> ],<br /> "Stage IDs": [<br /> 0,<br /> 1,<br /> 2<br /> ],<br /> "Properties": {<br /> "spark.rapids.sql.exec.CollectLimitExec": "true",<br /> "spark.executor.resource.gpu.amount": "1",<br /> "spark.rapids.sql.concurrentGpuTasks": "1",<br /> ...<br /> }<br />}</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerJobStart(<br /> jobId: Int,<br /> time: Long,<br /> stageInfos: Seq[StageInfo],<br /> properties: Properties = null)<br /> extends SparkListenerEvent {<br /> // Note: this is here for backwards-compatibility with older versions of this event which<br /> // only stored stageIds and not StageInfos:<br /> val stageIds: Seq[Int] = stageInfos.map(_.stageId)<br />}</pre>
<p>What is <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala" rel="nofollow" target="_blank">StageInfo</a>? <br /></p>
<pre class="brush:java; toolbar: false; auto-links: false">class StageInfo(<br /> val stageId: Int,<br /> private val attemptId: Int,<br /> val name: String,<br /> val numTasks: Int,<br /> val rddInfos: Seq[RDDInfo],<br /> val parentIds: Seq[Int],<br /> val details: String,<br /> val taskMetrics: TaskMetrics = null,<br /> private[spark] val taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty,<br /> private[spark] val shuffleDepId: Option[Int] = None,<br /> val resourceProfileId: Int) {<br /> /** When this stage was submitted from the DAGScheduler to a TaskScheduler. */<br /> var submissionTime: Option[Long] = None<br /> /** Time when all tasks in the stage completed or when the stage was cancelled. */<br /> var completionTime: Option[Long] = None<br /> /** If the stage failed, the reason why. */<br /> var failureReason: Option[String] = None<br /><br /> /**<br /> * Terminal values of accumulables updated during this stage, including all the user-defined<br /> * accumulators.<br /> */<br /> val accumulables = HashMap[Long, AccumulableInfo]()</pre>
<p>SparkListenerJobStart has the StageInfo which contains RDD information.</p><p>When Job starts, it may also contains modified properties which may override the application level properties got from SparkListenerEnvironmentUpdate. </p><p>It means, in the same application(event log), spark parameters could change so do not assume the parameters are always static inside the same application.</p><h3 style="text-align: left;">14. SparkListenerStageSubmitted<br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerStageSubmitted",<br /> "Stage Info": {<br /> "Stage ID": 1,<br /> "Stage Attempt ID": 0,<br /> "Stage Name": "executeColumnar at GpuShuffleCoalesceExec.scala:67",<br /> "Number of Tasks": 1000,<br /> "RDD Info": [<br /> {<br /> "RDD ID": 8,<br /> "Name": "MapPartitionsRDD",<br /> "Scope": "{\"id\":\"3\",\"name\":\"GpuColumnarExchange\"}",<br /> "Callsite": "executeColumnar at GpuShuffleCoalesceExec.scala:67",<br /> "Parent IDs": [<br /> 7<br /> ...<br /> "Submission Time": 1617729578789,<br /> "Accumulables": [],<br /> "Resource Profile Id": 0<br /> },<br /> "Properties": {<br /> "spark.rapids.sql.exec.CollectLimitExec": "true",<br /> "spark.executor.resource.gpu.amount": "1",<br /> ...</pre><p>Case class definition:</p>
<pre class="brush:text; toolbar: false; auto-links: false">case class SparkListenerStageSubmitted(stageInfo: StageInfo, properties: Properties = null)</pre>
<p>Similar as SparkListenerJobStart, the StageInfo is the key content here.</p><p>And again, parameters could change here.<br /></p><h3 style="text-align: left;">15. SparkListenerTaskEnd <br /></h3><p>Sample json object:</p>
<pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerTaskEnd",<br /> "Stage ID": 1,<br /> "Stage Attempt ID": 0,<br /> "Task Type": "ShuffleMapTask",<br /> "Task End Reason": {<br /> "Reason": "Success"<br /> },<br /> "Task Info": {<br /> "Task ID": 17,<br /> "Index": 1,<br /> "Attempt": 0,<br /> "Launch Time": 1617729578802,<br /> "Executor ID": "0",<br /> "Host": "192.192.192.2",<br /> "Locality": "PROCESS_LOCAL",<br /> "Speculative": false,<br /> "Getting Result Time": 0,<br /> "Finish Time": 1617729578977,<br /> "Failed": false,<br /> "Killed": false,<br /> "Accumulables": [<br /> {<br /> "ID": 21,<br /> "Name": "output rows",<br /> "Update": "10",<br /> "Value": "10",<br /> "Internal": true,<br /> "Count Failed Values": true,<br /> "Metadata": "sql"<br /> },<br /> ...<br /> },<br /> "Task Executor Metrics": {<br /> "JVMHeapMemory": 0,<br /> "JVMOffHeapMemory": 0,<br /> "OnHeapExecutionMemory": 0,<br /> "OffHeapExecutionMemory": 0,<br /> ...<br /> "Task Metrics": {<br /> "Executor Deserialize Time": 73,<br /> "Executor Deserialize CPU Time": 16058445,<br /> "Executor Run Time": 92,<br /> "Executor CPU Time": 59345832,<br /> "Peak Execution Memory": 0,<br /> "Result Size": 5303,<br /> "JVM GC Time": 0,<br /> "Result Serialization Time": 0,<br /> "Memory Bytes Spilled": 0,<br /> "Disk Bytes Spilled": 0,<br /> "Shuffle Read Metrics": {<br /> "Remote Blocks Fetched": 0,<br /> "Local Blocks Fetched": 1,<br /> "Fetch Wait Time": 0,<br /> "Remote Bytes Read": 0,<br /> "Remote Bytes Read To Disk": 0,<br /> "Local Bytes Read": 20652,<br /> "Total Records Read": 1<br /> },<br /> "Shuffle Write Metrics": {<br /> "Shuffle Bytes Written": 86,<br /> "Shuffle Write Time": 2697954,<br /> "Shuffle Records Written": 1<br /> },<br /> "Input Metrics": {<br /> "Bytes Read": 0,<br /> "Records Read": 0<br /> },<br /> "Output Metrics": {<br /> "Bytes Written": 0,<br /> "Records Written": 0<br /> },<br /> "Updated Blocks": []<br /> }<br />}</pre>
<p>Case class definition:</p>
<pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerTaskEnd(<br /> stageId: Int,<br /> stageAttemptId: Int,<br /> taskType: String,<br /> reason: TaskEndReason,<br /> taskInfo: TaskInfo,<br /> taskExecutorMetrics: ExecutorMetrics,<br /> // may be null if the task has failed<br /> @Nullable taskMetrics: TaskMetrics)<br /> extends SparkListenerEvent</pre>
<p>What is <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala" rel="nofollow" target="_blank">TaskMetrics</a>? <br /></p>
<pre class="brush:java; toolbar: false; auto-links: false">class TaskMetrics private[spark] () extends Serializable {<br /> // Each metric is internally represented as an accumulator<br /> private val _executorDeserializeTime = new LongAccumulator<br /> private val _executorDeserializeCpuTime = new LongAccumulator<br /> private val _executorRunTime = new LongAccumulator<br /> private val _executorCpuTime = new LongAccumulator<br /> private val _resultSize = new LongAccumulator<br /> private val _jvmGCTime = new LongAccumulator<br /> private val _resultSerializationTime = new LongAccumulator<br /> private val _memoryBytesSpilled = new LongAccumulator<br /> private val _diskBytesSpilled = new LongAccumulator<br /> private val _peakExecutionMemory = new LongAccumulator<br /> private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]</pre>
<p>SparkListenerTaskEnd may be the most important event if we want to profile the performance based on the event log.</p><p>Normally spark performance checking tool is always aggregating this TaskMetrics based on stage, job or SQL level.</p><p>In previous events, we can find out the job <-> stage and SQL<-> job mapping, together with the task <-> stage mapping got from this event, we can easily join them together and do aggregation.</p><p>Note that here, this event also sends out lots of accumulables.</p><p>Now we know how many of the events are sending and dealing with accumulables.<br /></p><h3 style="text-align: left;">16. SparkListenerStageCompleted<br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerStageCompleted",<br /> "Stage Info": {<br /> "Stage ID": 0,<br /> "Stage Attempt ID": 0,<br /> "Stage Name": "executeColumnar at GpuShuffleCoalesceExec.scala:67",<br /> "Number of Tasks": 16,<br /> "RDD Info": [<br /> {<br /> "RDD ID": 3,<br /> "Name": "MapPartitionsRDD",<br /> "Scope": "{\"id\":\"7\",\"name\":\"GpuColumnarExchange\"}",<br /> "Callsite": "executeColumnar at GpuShuffleCoalesceExec.scala:67",<br /> "Parent IDs": [<br /> 2<br /> ],<br />...<br /> "Submission Time": 1617729577270,<br /> "Completion Time": 1617729578759,<br /> "Accumulables": [<br /> {<br /> "ID": 47,<br /> "Name": "output rows",<br /> "Value": "2000000",<br /> "Internal": true,<br /> "Count Failed Values": true,<br /> "Metadata": "sql"<br /> },<br /> ],<br /> "Resource Profile Id": 0<br /> }<br />}</pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerStageCompleted(stageInfo: StageInfo) extends SparkListenerEvent</pre>
<p>Again : StageInfo is the key content, and again, accumulables inside StageInfo.</p><h3 style="text-align: left;">17. SparkListenerJobEnd <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerJobEnd",<br /> "Job ID": 0,<br /> "Completion Time": 1617729581438,<br /> "Job Result": {<br /> "Result": "JobSucceeded"<br /> }<br />}</pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerJobEnd(<br /> jobId: Int,<br /> time: Long,<br /> jobResult: JobResult)<br /> extends SparkListenerEvent</pre><p>SparkListenerJobEnd shows the job end timestamp which can be calculated to job duration.</p><p>Here <a href="https://github.com/apache/spark/blob/v3.1.1/core/src/main/scala/org/apache/spark/scheduler/JobResult.scala" rel="nofollow" target="_blank">JobResult</a> is a trait which can be used to fetch job status when finishing.<br /></p><h3 style="text-align: left;">18. SparkListenerTaskGettingResult <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerTaskGettingResult",<br /> "Task Info": {<br /> "Task ID": 1024,<br /> "Index": 7,<br /> "Attempt": 0,<br /> "Launch Time": 1617729607875,<br /> "Executor ID": "0",<br /> "Host": "111.111.111.111",<br /> "Locality": "PROCESS_LOCAL",<br /> "Speculative": false,<br /> "Getting Result Time": 1617729608076,<br /> "Finish Time": 0,<br /> "Failed": false,<br /> "Killed": false,<br /> "Accumulables": []<br /> }<br />} </pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerTaskGettingResult(taskInfo: TaskInfo) extends SparkListenerEvent</pre><p>SparkListenerTaskGettingResult can show the getting result time for specific task. <br /></p><h3 style="text-align: left;">19. SparkListenerApplicationEnd <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "SparkListenerApplicationEnd",<br /> "Timestamp": 1617729611879<br />}</pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerApplicationEnd(time: Long) extends SparkListenerEvent</pre><p>SparkListenerApplicationEnd only let us know the end timestamp for the application.<br /></p><h3 style="text-align: left;">20. SparkListenerSQLAdaptiveExecutionUpdate <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate",<br /> "executionId": 11,<br /> "physicalPlanDescription": "== Parsed Logical Plan ==...<br /> "sparkPlanInfo": {<br /> "nodeName": "GpuColumnarToRow",<br /> "simpleString": "GpuColumnarToRow false",<br /> "children": [<br /> {</pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false">case class SparkListenerSQLAdaptiveExecutionUpdate(<br /> executionId: Long,<br /> physicalPlanDescription: String,<br /> sparkPlanInfo: SparkPlanInfo)<br /> extends SparkListenerEvent</pre><p>SparkListenerSQLAdaptiveExecutionUpdate can be triggered when AQE is on, and it will override the query plan got from previous event SparkListenerSQLExecutionStart.</p><p>So if AQE is turned on(or in the future Spark 3.2 may turn on AQE by default), make sure wait for processing SparkListenerSQLAdaptiveExecutionUpdate before processing the query plan. <br /></p><p>This can impact accumulables because accumulables are defined inside SparkPlanInfo.</p><p>So the best way is to wait for all AQE related events arrived, and then deduplicate on the SparkPlanInfo collected before starting to calculate any accumulables.</p><h3 style="text-align: left;">21. SparkListenerSQLAdaptiveSQLMetricUpdates <br /></h3><p>Sample json object:</p><pre class="brush:text; toolbar: false; auto-links: false">{<br /> "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveSQLMetricUpdates",<br /> "executionId": 11,<br /> "sqlPlanMetrics": [<br /> {<br /> "name": "shuffle records written",<br /> "accumulatorId": 1111,<br /> "metricType": "sum"<br /> },<br /> {<br /> "name": "shuffle write time",<br /> "accumulatorId": 2222,<br /> "metricType": "nsTiming"<br /> },</pre><p>Case class definition:</p><pre class="brush:java; toolbar: false; auto-links: false"> case class SparkListenerSQLAdaptiveSQLMetricUpdates(<br /> executionId: Long,<br /> sqlPlanMetrics: Seq[SQLPlanMetric])<br /> extends SparkListenerEvent</pre><p>Again, accumulables. This event will udpate/add accumulables from SQLPlanMetric.<br /></p><p> </p><p>In all, there are so many different kinds of events in Spark event log, and there could be more I believe.</p><p>We need to look into Spark source code to understand how they work together to define the performance metrics for application, SQL, Job, Stage, Task levels.</p><p>Especially for accumulables, there are more than 6 types of events dealing with it:</p><ul style="text-align: left;"><li>Define accumulables types: SparkListenerSQLExecutionStart, SparkListenerSQLAdaptiveExecutionUpdate <br /></li><li>Send accumuables values: SparkListenerTaskEnd, SparkListenerStageCompleted, SparkListenerDriverAccumUpdates, SparkListenerSQLAdaptiveSQLMetricUpdates<br /></li></ul><p>For example, to calculate the max value for a accumulator, you may need to scan through all of above events to get the the real max value. <br /></p><p> </p><p> </p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-80343546898360879092021-04-20T21:00:00.003-07:002021-04-21T12:02:46.214-07:00How to use latest version of Rapids Accelerator for Spark on EMR<h1 style="text-align: left;">Goal:</h1><p>This article shows how to use latest version of Rapids Accelerator for Spark on EMR. </p><p>Currently the latest EMR 6.2 only ships with Rapids Accelerator 0.2.0 with cuDF 0.15 jar.</p><p>However as of today, the latest Rapids Accelerator is 0.4.1 with cuDF 0.18 jar.</p><p></p><p><b>Note: This is NOT official steps on enabling rapids+Spark on EMR, but just some technical research.</b><br /></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>EMR 6.2 <br /></p><h1 style="text-align: left;">Concept:</h1><p>As per EMR Doc on <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html" rel="nofollow" target="_blank">Using the Nvidia Spark-RAPIDS Accelerator for Spark</a>, it provides an option "enableSparkRapids":"true" in the configuration file when creating EMR.</p><p>Basically before we look for the solution to use latest version of Rapids Accelerator for Spark, we need to understand what does this option do. </p><p>As per my tests on EMR 6.2, this option will do below stuff:</p><p><b>1. Put the Rapids Accelerator 0.2.0 jar and cuDF 0.15 jar in below location with soft links</b><br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">/usr/lib/spark/jars/rapids-4-spark_2.12-0.2.0.jar -> /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_2.12-0.2.0.jar<br />/usr/lib/spark/jars/cudf-0.15-cuda10-1.jar -> /usr/share/aws/emr/spark-rapids/lib/cudf-0.15-cuda10-1.jar</pre>
<p><b>2. Put the getGpusResources.sh and xgboost4j-spark_3.0-1.0.0-0.2.0.jar</b></p>
<pre class="brush:bash; toolbar: false; auto-links: false">/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar<br />/usr/lib/spark/scripts/gpu/getGpusResources.sh</pre>
<p>Now here is another action item which is done regardless of the option(event when enableSparkRapids":"false"):<br /></p><p><b>3. Install the CUDA toolkit 10.1 with the soft link /usr/local/cuda pointing to it.<br /></b></p>
<pre class="brush:bash; toolbar: false; auto-links: false">/usr/local/cuda -> /mnt/nvidia/cuda-10.1</pre>
<p>After knowing all of above, then we may think of how about using <a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html" rel="nofollow" target="_blank">bootstrap actions</a> to change those jars, and install a newer version of CUDA toolkit say 11.0? </p><p>The answer is no. This is because unfortunately, our bootstrap action script will run BEFORE above steps. </p><p>It is like above steps are 2nd bootstrap actions.</p><p>Even if we used bootstrap actions script to replace above jars with the latest, and also install the latest CUDA toolkit 11.0 which can change the soft link /usr/local/cuda to point to cuda-11.0, eventually you will see 2 versions of Rapids Accelerator and cuDF jars in the same location, and also the the /usr/local/cuda will be changed back to point to cuda-10.1.</p><h1 style="text-align: left;">Solution:</h1><p>The solution is to disable the option to set it false in configuration: "enableSparkRapids":"false".</p><p>Since we already know what this option does, we just need to use bootstrap actions to mimic the same thing(of course, using all latest&greatest versions). <br /></p><h3 style="text-align: left;">1. Install CUDA Toolkit 11.0 and cuda-compat-11-0 </h3><p>We can not just simply install CUDA Toolkit 11.0 because the nvidia driver installed on EMR 6.2 is R418. To make CUDA Toolkit 11.0 running on the R418 driver, as per the <a href="https://docs.nvidia.com/deploy/cuda-compatibility/index.html" rel="nofollow" target="_blank">CUDA compatibility matrix</a>, the minimum required driver version is >= 450.36.06. </p><p>To make CUDA Toolkit 11.0 work on lower version of driver(forward compatible), we need to install a package named "cuda-compat".<br /></p><p>We need to firstly know which commands to install this version by going to this <a href="https://developer.nvidia.com/cuda-downloads" rel="nofollow" target="_blank">CUDA download page</a>. <br /></p><p>Then how could we know the OS version on EMR? EMR has its own customized linux OS "Amazon Linux 2":</p>
<pre class="brush:bash; toolbar: false; auto-links: false"># cat /etc/os-release<br />NAME="Amazon Linux"<br />VERSION="2"<br />ID="amzn"<br />ID_LIKE="centos rhel fedora"<br />VERSION_ID="2"<br />PRETTY_NAME="Amazon Linux 2"<br />ANSI_COLOR="0;33"<br />CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"<br />HOME_URL="https://amazonlinux.com/"</pre>
<p>To figure out which package is compatible, we can get the base OS version by using this command:</p><pre class="brush:bash; toolbar: false; auto-links: false">rpm -E %{rhel}</pre><p>Above will tell you it is redhat 7 based or compatible. Then we know which OS version to choose.<br /></p><p>Below commands are what we need:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">sudo yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo<br />sudo yum clean all<br />sudo yum -y install cuda-toolkit-11-0<br />sudo yum -y install cuda-compat-11-0 </pre>
<h3 style="text-align: left;">2. Fetch the Rapids Accelerator jar and cuDF jar</h3><p>You can always fetch the latest versions(or whatever version you want) by going to this <a href="https://nvidia.github.io/spark-rapids/docs/download.html" rel="nofollow" target="_blank">download page</a>. </p><p>Save the URLs for those 2 jars. Or you can choose to download them firstly and upload on a S3 bucket.<br /></p><p>In below example, I will fetch one jar directly from a URL, and fetch another jar from S3 bucket. <br /></p><h3 style="text-align: left;">3. Fetch the xgboost4j-spark jar<br /></h3><p> For spark 3.0, the latest jar can be downloaded here.</p><p><a href="https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/" rel="nofollow" target="_blank">https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/</a><br /></p><p>As of today, the latest version is:</p><p><a href="https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.3.0-0.1.0/xgboost4j-spark_3.0-1.3.0-0.1.0.jar" rel="nofollow" target="_blank">https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.3.0-0.1.0/xgboost4j-spark_3.0-1.3.0-0.1.0.jar </a><br /></p><p>Save this link.<br /></p><h3 style="text-align: left;">4. Fetch the getGpusResources.sh<br /></h3><p>Basically this file exist in Spark directory as well, but sometimes we do not know if our bootstrap script or some other EMR internal bootstrap script will run firstly.</p><p>It is better to always choose a stable link. Here let's use below link:</p><p><a href="https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh" rel="nofollow" target="_blank">https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh</a><br /></p><h3 style="text-align: left;">5. Prepare a bootstrap action script<br /></h3><p>Sample script named bootstrap-install-cuda-compat-11.sh:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">#!/bin/bash<br /><br />set -ex<br /><br />sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct<br />sudo chmod a+rwx -R /sys/fs/cgroup/devices<br /><br />echo "Install the cuda-compat-11-0"<br />sudo yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo<br />sudo yum clean all<br />sudo yum -y install cuda-toolkit-11-0<br />sudo yum -y install cuda-compat-11-0 <br />sudo rm -f /usr/lib/spark/jars/rapids-4-spark_2.12-0.2.0.jar<br />sudo rm -f /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_2.12-0.2.0.jar<br />sudo rm -f /usr/lib/spark/jars/cudf-0.15-cuda10-1.jar<br />sudo rm -f /usr/share/aws/emr/spark-rapids/lib/cudf-0.15-cuda10-1.jar<br />sudo mkdir -p /usr/share/aws/emr/spark-rapids/lib/<br />sudo mkdir -p /usr/lib/spark/jars/<br />sudo wget https://xxx/cudf-<version>.jar -O /usr/share/aws/emr/spark-rapids/lib/cudf-<version>.jar<br />sudo ln -s /usr/share/aws/emr/spark-rapids/lib/cudf-<version>.jar /usr/lib/spark/jars/cudf-<version>.jar<br />sudo aws s3 cp s3://<BUCKET-NAME>/rapids-4-spark_<version>.jar /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_<version>.jar<br />sudo ln -s /usr/share/aws/emr/spark-rapids/lib/rapids-4-spark_<version>.jar /usr/lib/spark/jars/rapids-4-spark_<version>.jar<br />sudo wget https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.3.0-0.1.0/xgboost4j-spark_3.0-1.3.0-0.1.0.jar -O /usr/lib/spark/jars/xgboost4j-spark_3.0-1.3.0-0.1.0.jar<br />sudo mkdir -p /usr/lib/spark/scripts/gpu/<br />sudo wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh -O /usr/lib/spark/scripts/gpu/getGpusResources.sh<br />sudo chmod +x /usr/lib/spark/scripts/gpu/getGpusResources.sh<br />sudo alternatives --set java /usr/lib/jvm/java-11-amazon-corretto.x86_64/bin/java</pre>
<p>Of course, you can make above shell script more robust by adding more checks but this is just a simplest demo.</p><p>I can find many other EMR bootstrap action scripts in <a href="https://github.com/aws-samples/emr-bootstrap-actions" rel="nofollow" target="_blank">this github</a> which you can refer to.<br /></p><p>And then copy the above bootstrap actions script on S3 bucket:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">chmod +x bootstrap-install-cuda-compat-11.sh<br />aws s3 cp bootstrap-install-cuda-compat-11.sh s3://BUCKET-NAME/bootstrap-install-cuda-compat-11.sh</pre>
<h3 style="text-align: left;">6. Prepare a configuration file<br /></h3><p>Say the name is EMR_java11_custom_bootstrap.json: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false; highlight: [5,50]">[<br /> {<br /> "Classification": "spark",<br /> "Properties": {<br /> "enableSparkRapids": "false"<br /> },<br /> "Configurations": []<br /> },<br /> {<br /> "Classification": "yarn-site",<br /> "Properties": {<br /> "yarn.nodemanager.linux-container-executor.cgroups.mount": "true",<br /> "yarn.nodemanager.linux-container-executor.cgroups.mount-path": "/sys/fs/cgroup",<br /> "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables": "/usr/bin",<br /> "yarn.nodemanager.linux-container-executor.cgroups.hierarchy": "yarn",<br /> "yarn.nodemanager.container-executor.class": "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor",<br /> "yarn.resource-types": "yarn.io/gpu",<br /> "yarn.nodemanager.resource-plugins": "yarn.io/gpu",<br /> "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices": "auto"<br /> },<br /> "Configurations": []<br /> },<br /> {<br /> "Classification": "container-executor",<br /> "Properties": {},<br /> "Configurations": [<br /> {<br /> "Classification": "gpu",<br /> "Properties": {<br /> "module.enabled": "true"<br /> },<br /> "Configurations": []<br /> },<br /> {<br /> "Classification": "cgroups",<br /> "Properties": {<br /> "root": "/sys/fs/cgroup",<br /> "yarn-hierarchy": "yarn"<br /> },<br /> "Configurations": []<br /> }<br /> ]<br /> },<br /> {<br /> "Classification": "spark-defaults",<br /> "Properties": {<br /> "spark.task.cpus ": "1",<br /> "spark.rapids.sql.explain": "ALL",<br /> "spark.submit.pyFiles": "/usr/lib/spark/jars/xgboost4j-spark_3.0-1.3.0-0.1.0.jar",<br /> "spark.executor.extraLibraryPath": "/usr/local/cuda-11.0/targets/x86_64-linux/lib:/usr/local/cuda-11.0/extras/CUPTI/lib64:/usr/local/cuda-11.0/compat/:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",<br /> "spark.plugins": "com.nvidia.spark.SQLPlugin",<br /> "spark.executor.cores": "1",<br /> "spark.sql.files.maxPartitionBytes": "512m",<br /> "spark.executor.resource.gpu.discoveryScript": "/usr/lib/spark/scripts/gpu/getGpusResources.sh",<br /> "spark.sql.shuffle.partitions": "200",<br /> "spark.executor.defaultJavaOptions": "-XX:+IgnoreUnrecognizedVMOptions",<br /> "spark.task.resource.gpu.amount": "0.0625",<br /> "spark.rapids.memory.pinnedPool.size": "2G",<br /> "spark.executor.resource.gpu.amount": "1",<br /> "spark.rapids.sql.enabled": "true",<br /> "spark.sql.adaptive.enabled": "false",<br /> "spark.locality.wait": "0s",<br /> "spark.sql.sources.useV1SourceList": "",<br /> "spark.executor.memoryOverhead": "2G",<br /> "spark.driver.defaultJavaOptions": "-XX:+IgnoreUnrecognizedVMOptions",<br /> "spark.rapids.sql.concurrentGpuTasks": "1"<br /> },<br /> "Configurations": []<br /> },<br /> {<br /> "Classification": "capacity-scheduler",<br /> "Properties": {<br /> "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"<br /> },<br /> "Configurations": []<br /> },<br /> {<br /> "Classification": "spark-env",<br /> "Properties": {},<br /> "Configurations": [<br /> {<br /> "Classification": "export",<br /> "Properties": {<br /> "JAVA_HOME": "/usr/lib/jvm/java-11-amazon-corretto.x86_64/"<br /> },<br /> "Configurations": []<br /> }<br /> ]<br /> }<br />]</pre>
<p><b>Note: in above configuration file, we specified the /usr/local/cuda-11.0 in "spark.executor.extraLibraryPath" because the soft link /usr/local/cuda is still pointing to old cuda-10.1.</b></p><p><b>Note: /usr/local/cuda-11.0/compat/ contains the libs from cuda-compat-11-0 we installed earlier. </b><br /></p><h3 style="text-align: left;">7. Start the EMR cluster using CLI <br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">aws emr create-cluster \<br />--release-label emr-6.2.0 \<br />--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \<br />--service-role EMR_DefaultRole \<br />--ec2-attributes KeyName=hao-emr,InstanceProfile=EMR_EC2_DefaultRole \<br />--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \<br /> InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \<br /> InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \<br />--configurations file:///xxx/EMR_java11_custom_bootstrap.json \<br />--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://BUCKET-NAME/bootstrap-install-cuda-compat-11.sh \<br />--ebs-root-volume-size 100 </pre>
<p><b>Note: EBS root value size should be increased from default 10G to larger to avoid running out of disk space when installing packages using yum. <br /></b></p><h3 style="text-align: left;">8. Monitor the bootstrap process<br /></h3><p>Normally master node will be ready first. So SSH on master node, and find the bootstrap actions' logs here: /mnt/var/log/bootstrap-actions</p><h3 style="text-align: left;">9. Test</h3><p>Once all nodes are ready, run below in spark-shell from master node to make sure the GPU plan is shown:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">val data = 1 to 100<br />val df1 = sc.parallelize(data).toDF()<br />val df2 = sc.parallelize(data).toDF()<br />val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value")<br />out.count()<br />out.explain()</pre>
<h3 style="text-align: left;">10. Delete the EMR cluster once tests are done.<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false">aws emr terminate-clusters --cluster-ids j-xxxxxxxxxxx</pre>
<h2 style="text-align: left;">Common issues<br /></h2><p><b>1. ERROR NativeDepsLoader: Could not load cudf jni library...</b></p><p>Below errors and stack trace show in Spark executor logs when launching spark-shell:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">Caused by: java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: /mnt/yarn/usercache/hadoop/appcache/application_xxx_xxx/container_xxx_xxx_01_xxxxx/tmp/nvcomp4429409488498215695.so: libcudart.so.11.0: cannot open shared object file: No such file or directory<br /> at java.util.concurrent.FutureTask.report(FutureTask.java:122)<br /> at java.util.concurrent.FutureTask.get(FutureTask.java:192)<br /> at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:167)<br /> ... 34 more<br />Caused by: java.lang.UnsatisfiedLinkError: /mnt/yarn/usercache/hadoop/appcache/application_xxx_xxx/container_xxx_xxx_01_xxxxx/tmp/nvcomp4429409488498215695.so: libcudart.so.11.0: cannot open shared object file: No such file or directory<br /> at java.lang.ClassLoader$NativeLibrary.load(Native Method)<br /> at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934)<br /> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817)<br /> at java.lang.Runtime.load0(Runtime.java:810)<br /> at java.lang.System.load(System.java:1088)<br /> at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:184)<br /> at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:198)<br /> at ai.rapids.cudf.NativeDepsLoader.lambda$loadNativeDeps$1(NativeDepsLoader.java:161)<br /> ... 5 more</pre>
<p>Make sure the CUDA Toolkit 11.0 is installed and is set in spark.executor.extraLibraryPath of configuration file.</p><p><b>2. ai.rapids.cudf.CudaException: CUDA driver version is insufficient for CUDA runtime version <br /></b></p><p>Below errors and stack trace show in Spark executor logs when launching spark-shell:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">ai.rapids.cudf.CudaException: CUDA driver version is insufficient for CUDA runtime version<br /> at ai.rapids.cudf.Cuda.setDevice(Native Method)<br /> at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:95)<br /> at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:122)<br /> at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23)<br /> at scala.Option.map(Option.scala:230)<br /> at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:122)<br /> at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:130)<br /> at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:168)<br /> at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111)<br /> at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)<br /> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)<br /> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)<br /> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)<br /> at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)<br /> at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)<br /> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)<br /> at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99)<br /> at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164)<br /> at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152)<br /> at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:220)<br /> at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221)<br /> at org.apache.spark.executor.Executor.<init>(Executor.scala:220)<br /> at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168)<br /> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)<br /> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)<br /> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)<br /> at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)<br /> at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)<br /> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)<br /> at java.util.concurrent.FutureTask.run(FutureTask.java:266)<br /> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)<br /> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)<br /> at java.lang.Thread.run(Thread.java:748)</pre>
<p>Make sure the cuda-compat-11-0 is installed and its location is set correctly in spark.executor.extraLibraryPath of configuration file.</p><p> </p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-45991958539285066032021-04-12T16:15:00.000-07:002021-04-12T16:15:00.203-07:00How to use NVIDIA Nsight Systems to profile a Spark on K8s job with Rapids Accelerator <h1 style="text-align: left;">Goal:</h1><p>This article explains how to use NVIDIA Nsight Systems to profile a Spark on K8s job with Rapids Accelerator.<br /></p><p>This is a follow-up blog after<a href="http://www.openkb.info/2021/04/how-to-use-nvidia-nsight-systems-to.html" rel="nofollow" target="_blank"> How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator</a>. <br /></p><h1 style="text-align: left;"><span><a name='more'></a></span>Env:</h1><p style="text-align: left;">Spark 3.1.1 (on Kubernetes)<br /></p><p style="text-align: left;">RAPIDS Accelerator for Apache Spark 0.5 snapshot</p>cuDF jar 0.19 snapshot<h1 style="text-align: left;">Solution:</h1><p>Please read <a href="http://www.openkb.info/2021/04/how-to-use-nvidia-nsight-systems-to.html" rel="nofollow" target="_blank">How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator</a> blog and also <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a> doc firstly. <br /></p><p>This blog will mainly focus on differences for Spark on Kubernetes job.<br /></p><h3 style="text-align: left;">1. Spark side<br /></h3><p>As we know, "nsys profile" should target a Spark Executor process. So the key is to find out how does Spark start an Executor in a Kubernetes cluster. <br /></p><p>Basically it is <a href="https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh" rel="nofollow" target="_blank">resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh</a> <br /></p><pre class="brush:bash; toolbar: false; auto-links: false"> executor)<br /> shift 1<br /> CMD=(<br /> ${JAVA_HOME}/bin/java<br /> "${SPARK_EXECUTOR_JAVA_OPTS[@]}"<br /> -Xms$SPARK_EXECUTOR_MEMORY<br /> -Xmx$SPARK_EXECUTOR_MEMORY<br /> -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"<br /> org.apache.spark.executor.CoarseGrainedExecutorBackend<br /> --driver-url $SPARK_DRIVER_URL<br /> --executor-id $SPARK_EXECUTOR_ID<br /> --cores $SPARK_EXECUTOR_CORES<br /> --app-id $SPARK_APPLICATION_ID<br /> --hostname $SPARK_EXECUTOR_POD_IP<br /> --resourceProfileId $SPARK_RESOURCE_PROFILE_ID<br /> )<br />...<br /><br /># Execute the container CMD under tini for better hygiene<br />exec /usr/bin/tini -s -- "${CMD[@]}" </pre>
<p>So we just need to change the CMD part to add "nsys profile" before that. </p><p>Such as:</p>
<pre class="brush:bash; toolbar: false; auto-links: false;highlight: 4"> executor)<br /> shift 1<br /> CMD=(<br /> nsys profile -o /some_persistent_storage/test_%h_%p.qdrep<br /> ${JAVA_HOME}/bin/java<br /> "${SPARK_EXECUTOR_JAVA_OPTS[@]}"<br /> -Xms$SPARK_EXECUTOR_MEMORY<br /> -Xmx$SPARK_EXECUTOR_MEMORY<br /> -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"<br /> org.apache.spark.executor.CoarseGrainedExecutorBackend<br /> --driver-url $SPARK_DRIVER_URL<br /> --executor-id $SPARK_EXECUTOR_ID<br /> --cores $SPARK_EXECUTOR_CORES<br /> --app-id $SPARK_APPLICATION_ID<br /> --hostname $SPARK_EXECUTOR_POD_IP<br /> --resourceProfileId $SPARK_RESOURCE_PROFILE_ID<br /> )<br /> ;;</pre>
<p>Here we specified the output file to a persistent storage path which can be mounted in the docker container. </p><p>"%h" means hostname and "%p" means PID. For more details please refer to <a href="https://docs.nvidia.com/nsight-systems/UserGuide/index.html" rel="nofollow" target="_blank">Nsight System user guide</a>.<br /></p><h3 style="text-align: left;">2. Docker image side</h3><p>If you are using the <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/docs/get-started/Dockerfile.cuda" rel="nofollow" target="_blank">Dockerfile.cuda</a> , it actuall uses <a href="https://hub.docker.com/layers/nvidia/cuda/10.1-devel-ubuntu18.04/images/sha256-224aaba2c72e749f24da167d18d83908ad89c9d2af2ae89100a9858b51a71c37" rel="nofollow" target="_blank">nvidia/cuda:10.1-devel-ubuntu18.04</a> as the base image. However this base image does not have Nsight Systems installed.</p><p>You need to either use your own base image which has Nsight Systems installed or adding the installation script into Dockerfile.cuda.</p><p>Below is one example to install Nsight Systems from CUDA 11.0.3 repo:</p>
<pre class="brush:bash; toolbar: false; auto-links: false"># Install Nsight-systems<br />RUN apt install -y wget && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin<br />RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600<br />RUN wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb<br />RUN dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb<br />RUN apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub<br />RUN apt-get update && apt-get install -y nsight-systems-2020.4.3</pre>
<h3 style="text-align: left;">3. Build&upload the Docker Image and Run the Spark on K8s Job</h3><p>The rest steps are the same as <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a> doc.</p><p> </p><p> </p><p> <br /></p><p><br /></p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p>=== <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-64180463022060354292021-04-08T21:52:00.014-07:002021-04-11T17:47:26.419-07:00How to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator<h1 style="text-align: left;">Goal:</h1><p style="text-align: left;">This article explains how to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator.<span></span></p><a name='more'></a><p style="text-align: left;"></p><h1 style="text-align: left;">Env:</h1><p style="text-align: left;">Spark 3.1.1 (Standalone Cluster)<br /></p><p style="text-align: left;">RAPIDS Accelerator for Apache Spark 0.5 snapshot</p><p style="text-align: left;">cuDF jar 0.19 snapshot<br /></p><h1 style="text-align: left;">Solution:</h1><h3 style="text-align: left;"><b>1. Build the cuDF JARs with USE_NVTX option on.</b><br /></h3><p style="text-align: left;">Follow Doc: <a href="https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html" rel="nofollow" target="_blank">https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html </a><br /></p><p style="text-align: left;"><b>Note: Starting from cuDF 0.19, the USE_NVTX(</b><b>NVIDIA Tools Extension) is on by default as per this <a href="https://github.com/rapidsai/cudf/pull/7761" rel="nofollow" target="_blank">PR</a> so we do not need to build jar any more. It means in the future cuDF release(>=0.19) we can skip this step.</b><br /></p><p style="text-align: left;">So here in this test, I just used the latest <a href="https://oss.sonatype.org/content/repositories/snapshots/ai/rapids/cudf/0.19-SNAPSHOT/" rel="nofollow" target="_blank">cuDF 0.19 snapshot jar</a> and Rapids Accelerator 0..5 snapshot jar(built from <a href="https://github.com/NVIDIA/spark-rapids" rel="nofollow" target="_blank">source code</a> manually) together. <b>Note: these 2 jars are not stable releases.</b><br /></p><h3 style="text-align: left;">2. Download nsight systems on your client machine<br /></h3><p style="text-align: left;"><a href="https://developer.nvidia.com/nsight-systems" rel="nofollow" target="_blank">https://developer.nvidia.com/nsight-systems</a><br /></p><p style="text-align: left;">Here I downloaded and installed on Mac where I will view the metrics later.<br /></p><h3 style="text-align: left;">3. Make sure target machine has nsys installed and meet requirements.<br /></h3><div style="text-align: left;">Please refer to <a href="https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html" rel="nofollow" target="_blank">https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html</a> for details. </div><div style="text-align: left;">Especially make sure the "Requirement" is met. Such as:</div><div style="text-align: left;">Use of Linux Perf: To collect thread scheduling data and IP (instruction pointer) samples, the Perf paranoid level on the target system must be 2 or less. </div><div style="text-align: left;">You can use "nsys status -e" to check the current status:</div>
<pre class="brush:bash; toolbar: false; auto-links: false">$ nsys status -e<br /><br />Sampling Environment Check<br />Linux Kernel Paranoid Level = 3: Fail<br />Linux Distribution = Ubuntu<br />Linux Kernel Version = 5.4.0-70: OK<br />Linux perf_event_open syscall available: Fail<br />Sampling trigger event available: Fail<br />Intel(c) Last Branch Record support: Not Available<br />Sampling Environment: Fail<br /><br />See the product documentation for more information.</pre>
<div style="text-align: left;">If the Kernel Paranoid Level check failed, then we can use below commands to check and enable it:<br /></div>
<pre class="brush:xml; toolbar: false; auto-links: false;highlight: [1,3,4,6,7]">$ cat /proc/sys/kernel/perf_event_paranoid<br />3<br />$ sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'<br />$ cat /proc/sys/kernel/perf_event_paranoid<br />2<br />$ sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf'<br />$ nsys status -e<br /><br />Sampling Environment Check<br />Linux Kernel Paranoid Level = 2: OK<br />Linux Distribution = Ubuntu<br />Linux Kernel Version = 5.4.0-70: OK<br />Linux perf_event_open syscall available: OK<br />Sampling trigger event available: OK<br />Intel(c) Last Branch Record support: Available<br />Sampling Environment: OK</pre>
<div style="text-align: left;"><b>Note: there are other requirements like kernel version, glibc version, supported CUDA version. Please refer to above documentation. </b><br /></div><h3 style="text-align: left;">4. Add extra java options in both driver and executor.<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">--conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true<br />--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true</pre>
<p style="text-align: left;">You can consider putting those into spark-defaults.conf or specifying them each time for spark-shell/spark-sql/etc.</p><p style="text-align: left;">If you have other extraJavaOption(s), do not forget to append them.</p><h3 style="text-align: left;">5. Start spark-shell using "nsys profile"<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">nsys profile bash -c " \<br />CUDA_VISIBLE_DEVICES=0 ${SPARK_HOME}/sbin/start-slave.sh $master_url & \<br />$SPARK_HOME/bin/spark-shell; \<br />${SPARK_HOME}/sbin/stop-slave.sh"</pre>
<h3 style="text-align: left;">6. Run some query</h3><p style="text-align: left;">When quitting spark-shell, it will generate a *.qdrep file in current directory.</p><p style="text-align: left;">For example:</p>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">scala> :quit<br />:quit<br />stopping org.apache.spark.deploy.history.HistoryServer<br />stopping org.apache.spark.deploy.worker.Worker<br />stopping org.apache.spark.deploy.master.Master<br />Processing events...<br />Capturing symbol files...<br />Saving temporary "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdstrm" file to disk...<br />Creating final output files...<br /><br />Processing [==============================================================100%]<br />Saved report file to "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdrep"<br />Report file moved to "/home/xxx/report1.qdrep"</pre>
<h3 style="text-align: left;">7. Use "nsys stat" command on the target machine to check the report</h3><div style="text-align: left;">You can choose to use "nsys stat" command on the target machine to check the report or use following GUI option.</div><div style="text-align: left;">"nsys stat" can show the CUDA API summary, GPU Kernel summary, GPU Memory time summary, NVTX push-pop range summary, etc:</div>
<pre class="brush:bash; toolbar: false; auto-links: false">$ nsys stats report8.qdrep<br />Using report8.sqlite file for stats and reports.<br />Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/cudaapisum report8.sqlite] to console...<br /><br /> Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name<br /> ------- --------------- --------- --------------- ------------- ------------- --------------------------<br /> 66.8 152,391,401,099 192,250 792,673.1 679 18,448,141 cudaStreamSynchronize_ptsz<br /> 31.2 71,169,590,822 114,830 619,782.2 195 9,667,534 cudaMemcpyAsync_ptsz<br /> 0.7 1,565,365,626 7 223,623,660.9 3,454 1,565,334,856 cudaFree<br /> 0.5 1,117,531,408 65,671 17,017.1 3,496 131,888 cudaLaunchKernel_ptsz<br />...<br />Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum report8.sqlite] to console...<br /><br /> Time(%) Total Time (ns) Instances Average Minimum Maximum Name<br /> ------- --------------- --------- ------------ ---------- ---------- ----------------------------------------------------------------------------------------------------<br /> 37.5 83,645,234,788 14,576 5,738,558.9 5,554,755 6,897,949 void (anonymous namespace)::scatter_kernel<int, (anonymous namespace)::boolean_mask_filter<false>, …<br /> 28.2 62,805,133,776 7,288 8,617,608.9 8,459,988 8,955,404 void cudf::binops::jit::kernel_v_v<bool, int, int, cudf::binops::jit::Greater>(int, bool*, int*, in…<br /> 18.8 41,854,794,778 7,288 5,742,974.0 5,634,787 5,984,609 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…<br /> 8.7 19,342,375,816 7,289 2,653,639.2 2,575,613 2,869,850 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…<br />...<br />Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemtimesum report8.sqlite] to console...<br /><br /> Time(%) Total Time (ns) Operations Average Minimum Maximum Operation<br /> ------- --------------- ---------- -------- ------- ------- ------------------<br /> 47.8 78,733,508 82,908 949.6 608 610,013 [CUDA memcpy DtoH]<br /> 35.7 58,761,119 80,174 732.9 640 13,792 [CUDA memset]<br /> 16.4 26,979,351 31,900 845.7 671 662,844 [CUDA memcpy HtoD]<br /> 0.1 136,064 8 17,008.0 1,632 32,640 [CUDA memcpy DtoD]<br />Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemsizesum report8.sqlite] to console...<br /><br /> Total Operations Average Minimum Maximum Operation<br /> ---------- ---------- --------- ------- --------- ------------------<br /> 37,577.836 31,900 1.178 0.004 7,813.324 [CUDA memcpy HtoD]<br /> 32,226.750 8 4,028.344 244.187 7,812.500 [CUDA memcpy DtoD]<br /> 24,145.266 82,908 0.291 0.001 7,812.500 [CUDA memcpy DtoH]<br /> 16,326.898 80,174 0.204 0.001 7,812.500 [CUDA memset]<br /> ...<br /> Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum report8.sqlite] to console...<br /><br /> Time(%) Total Time (ns) Instances Average Minimum Maximum Range<br /> ------- --------------- --------- --------------- ------------- ------------- -------------------------------<br /> 41.0 209,116,856,965 10,002 20,907,504.2 117,938 23,476,086 libcudf:apply_boolean_mask<br /> 41.0 209,039,719,367 10,002 20,899,792.0 116,416 23,467,375 libcudf:copy_if<br /> 16.7 85,273,533,436 10,000 8,527,353.3 8,431,597 13,684,934 libcudf:cross_join<br />...</pre>
<h3 style="text-align: left;">8. Copy the *.qdrep to the client machine where nsight systems is installed.<br /></h3><p style="text-align: left;">Open the *.qdrep using nsight systems. <br /></p><p style="text-align: left;">My query in above #5 is a cross-join which takes around 6mins. <br /></p><p style="text-align: left;">Normally I will firstly "Analysis Summary" tab to get the PID of <b>Spark Executor</b>(24897) which would be my focus. <br /></p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgArIcp9ZDe0Dukg7Xa5ugL47gXFIz3aksRTeouo0nnGeoHywCz3sNnEsQ85yg2nJlRI462YyhN8Uu6DSjJeIoviDh6RNT1yTfPLOhKHhgd5OhMrNuwHw3Of_QEcgQz7EI5n5kg1lHWanA/s406/Screen+Shot+2021-04-08+at+9.37.06+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="332" data-original-width="406" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgArIcp9ZDe0Dukg7Xa5ugL47gXFIz3aksRTeouo0nnGeoHywCz3sNnEsQ85yg2nJlRI462YyhN8Uu6DSjJeIoviDh6RNT1yTfPLOhKHhgd5OhMrNuwHw3Of_QEcgQz7EI5n5kg1lHWanA/s320/Screen+Shot+2021-04-08+at+9.37.06+PM.png" width="320" /></a></div><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihkUoObwCYcvWSJCK3z1s01G7f6ZP_fm0maqk-PNn88zRU0ElV1SQYQiQ7k96138nqSA-PPQjIwX1LO0ZkzELudoYUjFLvJhBxfdIaDJTU-nqjs3BOvoYtod_MHgqc9G2tqvdGVpCZc6k/s1704/Screen+Shot+2021-04-09+at+4.19.19+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="212" data-original-width="1704" height="80" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihkUoObwCYcvWSJCK3z1s01G7f6ZP_fm0maqk-PNn88zRU0ElV1SQYQiQ7k96138nqSA-PPQjIwX1LO0ZkzELudoYUjFLvJhBxfdIaDJTU-nqjs3BOvoYtod_MHgqc9G2tqvdGVpCZc6k/w640-h80/Screen+Shot+2021-04-09+at+4.19.19+PM.png" width="640" /></a></div><span></span><br /><p></p><p>Then move to "Timeline view" tab and identify Spark Executor process:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTXsE3eF6MCV_oRrtXEBnlt5UpAnPJ2BXgVzQFNcYhGuOOCjjE7fhEUtwbtc59RHQrESRW-KNYKmR4ijKgYEagPi50SBmqQs20qru4s_s-Cul0MnsTHr_FrLIxZ8LHjlT1kxQk_cutIX0/s2192/Screen+Shot+2021-04-09+at+4.22.08+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="526" data-original-width="2192" height="154" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTXsE3eF6MCV_oRrtXEBnlt5UpAnPJ2BXgVzQFNcYhGuOOCjjE7fhEUtwbtc59RHQrESRW-KNYKmR4ijKgYEagPi50SBmqQs20qru4s_s-Cul0MnsTHr_FrLIxZ8LHjlT1kxQk_cutIX0/w640-h154/Screen+Shot+2021-04-09+at+4.22.08+PM.png" width="640" /></a></div><p></p><p style="text-align: left;"></p><p style="text-align: left;"></p>As we can see the CUDA HW(GPU) is showing busy(blue) for most of the time. <br /><p style="text-align: left;"></p><p style="text-align: left;">If we hover mouse on it, it can show you the CUDA Kernel running% at that time:</p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt3ocn8qW3i9mZ-U2fyFPJe8BpXZ-L2djKhH24lW76VWMinTPy0vyYtCKdOBjrN5-9LLJQ3NCeyJJMfFJHY3cuBZM0D3nzTb7bXQzAT19laI4WUa3qJTEhjKrnRQlwXNUQajICvCHCgyI/s616/Screen+Shot+2021-04-08+at+9.41.14+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="320" data-original-width="616" height="332" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt3ocn8qW3i9mZ-U2fyFPJe8BpXZ-L2djKhH24lW76VWMinTPy0vyYtCKdOBjrN5-9LLJQ3NCeyJJMfFJHY3cuBZM0D3nzTb7bXQzAT19laI4WUa3qJTEhjKrnRQlwXNUQajICvCHCgyI/w640-h332/Screen+Shot+2021-04-08+at+9.41.14+PM.png" width="640" /></a></div><p>We can dig further into all threads of Spark Executor process, and we can identify the Executor Task 1 thread keeps calling CUDA API during that time. </p><p>And most importantly, here the "libcudf" and "NVTX(libcudf)" rows will show up. <b> </b></p><p><b>Note:They will NOT show up if "NVTX" is not switched on when building cuDF jar.</b><br /></p><p>Here "libcudf" row shows "cross_join" which match our query type.<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8TeKybhEpZIDsxHolFT1XZexSsONjiPLRkkMFG4ikxLD-rHBujM7hmoD2-CUmnNwxYoHo1WTUeMezTH3l3zcCDaU6GSBzAHSnMs37-UYhnvYO_z_k_uu1HnAPhyphenhyphenhsRthEXMBpEcrVwtk/s2302/Screen+Shot+2021-04-09+at+4.24.42+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="508" data-original-width="2302" height="142" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8TeKybhEpZIDsxHolFT1XZexSsONjiPLRkkMFG4ikxLD-rHBujM7hmoD2-CUmnNwxYoHo1WTUeMezTH3l3zcCDaU6GSBzAHSnMs37-UYhnvYO_z_k_uu1HnAPhyphenhyphenhsRthEXMBpEcrVwtk/w640-h142/Screen+Shot+2021-04-09+at+4.24.42+PM.png" width="640" /></a></div><p style="text-align: left;"> "NVTX(libcudf)" row shows similar things under "CUDA HW" section:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhosJ7N7knfJs15-13-SV71zPDuESFP9safCHAHfkG9yAEWN8VidbH2V2VLC4XJUwVRMol_sXZRIgEmObgVBxlWH8f8xpRPNPRMNcTxWmnkhPD6NZVKUBXsOUQHr2kocQ7JoS5et6ru9DY/s2230/Screen+Shot+2021-04-09+at+4.32.14+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="602" data-original-width="2230" height="173" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhosJ7N7knfJs15-13-SV71zPDuESFP9safCHAHfkG9yAEWN8VidbH2V2VLC4XJUwVRMol_sXZRIgEmObgVBxlWH8f8xpRPNPRMNcTxWmnkhPD6NZVKUBXsOUQHr2kocQ7JoS5et6ru9DY/w640-h173/Screen+Shot+2021-04-09+at+4.32.14+PM.png" width="640" /></a></div><br /><h2 style="text-align: left;">Tips: <br /></h2><h3 style="text-align: left;"><b>1. One useful tip is to pin the related rows and compare:</b></h3><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9VlaDIuVNonglA-rrR3ocBKokEDcyEA4ovMVBcpkFiZ2NsMH7Tk0aet2kTbruGsU4WJVhxrXCwzfAo4V7AJrWxsHXHrH3VYglBIwgrY2Q86-afjyfcYAe9W9VAyBLoobH1ofLM43ZWA0/s810/Screen+Shot+2021-04-08+at+9.46.52+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="546" data-original-width="810" height="432" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9VlaDIuVNonglA-rrR3ocBKokEDcyEA4ovMVBcpkFiZ2NsMH7Tk0aet2kTbruGsU4WJVhxrXCwzfAo4V7AJrWxsHXHrH3VYglBIwgrY2Q86-afjyfcYAe9W9VAyBLoobH1ofLM43ZWA0/w640-h432/Screen+Shot+2021-04-08+at+9.46.52+PM.png" width="640" /></a></div><p style="text-align: left;">After those rows got pinned, if you scroll down/up, they will always be on top or at bottom, such as:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgaJBHWIkKtyK3tlgyd2lbIw-xIi3q3lQJK3_kVwksxMPkPNzzYL_rQytOY803Q7Mw2mgy_64BXhrMkF4jwVrCrUqBY4iEU13_zMiR3XP6PrR5uXHVTwZGbIZ_rfkOyorGFMJSVW1b3cn0/s1136/Screen+Shot+2021-04-08+at+9.47.53+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="436" data-original-width="1136" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgaJBHWIkKtyK3tlgyd2lbIw-xIi3q3lQJK3_kVwksxMPkPNzzYL_rQytOY803Q7Mw2mgy_64BXhrMkF4jwVrCrUqBY4iEU13_zMiR3XP6PrR5uXHVTwZGbIZ_rfkOyorGFMJSVW1b3cn0/w640-h246/Screen+Shot+2021-04-08+at+9.47.53+PM.png" width="640" /></a></div><h3 style="text-align: left;"><b>2. Change the time from "session time" to "global time"</b></h3><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZHvzwk-g4ymTeHIPn5vqPaQcvZzdmDnkPSPDsCGb8Tk4MqaeIgd9ILwSf7dOdbeNpvlLEXcpUqkPm9HrSFOvGzLlb6POk4qgz7Nm_bmz0t60lacphM6_gq40o7F5MZRA6uL2Me9QHlVo/s894/Screen+Shot+2021-04-08+at+9.49.24+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="320" data-original-width="894" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZHvzwk-g4ymTeHIPn5vqPaQcvZzdmDnkPSPDsCGb8Tk4MqaeIgd9ILwSf7dOdbeNpvlLEXcpUqkPm9HrSFOvGzLlb6POk4qgz7Nm_bmz0t60lacphM6_gq40o7F5MZRA6uL2Me9QHlVo/w640-h230/Screen+Shot+2021-04-08+at+9.49.24+PM.png" width="640" /></a></div><p style="text-align: left;">After that, it will show machine time which can help you match the real world time.<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE5p_LTT60Gkll9qd6MIvhQukYM8dxRYSigdUI7oXk5_vRT36xfeigl9Bv_u1ynVvwiLfSH3iBKvyYLOrq7hwzYtmobhTfHQqzMEEwuvQvuLTIrXrdLfuHx0t3rvJi3JJppSKwdKuBhKw/s1240/Screen+Shot+2021-04-08+at+9.51.00+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="226" data-original-width="1240" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE5p_LTT60Gkll9qd6MIvhQukYM8dxRYSigdUI7oXk5_vRT36xfeigl9Bv_u1ynVvwiLfSH3iBKvyYLOrq7hwzYtmobhTfHQqzMEEwuvQvuLTIrXrdLfuHx0t3rvJi3JJppSKwdKuBhKw/w640-h116/Screen+Shot+2021-04-08+at+9.51.00+PM.png" width="640" /></a></div><h3 style="text-align: left;"><b>3. How to start/stop collection manually<br /></b></h3>
<div style="text-align: left;">We can firstly "<i><b>nsys launch</b></i>" the Spark worker/slave, and then use "<i><b>nsys start</b></i>" and "<b><i>nsys stop</i></b>" to control the collection window manually.</div><div style="text-align: left;"><b>a. Stop spark slaves manually</b><br /></div>
<pre class="brush:bash; toolbar: false; auto-links: false">${SPARK_HOME}/sbin/stop-slave.sh</pre>
<div style="text-align: left;"><b>b. Start spark slaves using "nsys launch"</b>
<pre class="brush:bash; toolbar: false; auto-links: false">nsys launch bash -c "CUDA_VISIBLE_DEVICES=0 $SPARK_HOME/sbin/start-slave.sh spark://$HOSTNAME:7077 &"</pre><b>
c. Open another terminal session, run "nsys start"</b>
<pre class="brush:bash; toolbar: false; auto-links: false">$ nsys start<br />$ nsys sessions list<br /> ID TIME STATE LAUNCH NAME<br /> 1028142 00:51 Collecting 1 [default]</pre><b>
d. Run a Spark job using either spark-shell or spark-submit or something else.<br /></b></div><div style="text-align: left;"><b>e. Run "nsys stop" after the Spark job completes <br /></b></div>
<pre class="brush:bash; toolbar: false; auto-links: false">$ nsys stop<br />Processing events...<br />Capturing symbol files...<br />Saving temporary "/tmp/nsys-report-4026-c2c5-8a18-5372.qdstrm" file to disk...<br />Creating final output files...<br /><br />Processing [==============================================================100%]<br />Saved report file to "/tmp/nsys-report-4026-c2c5-8a18-5372.qdrep"<br />Report file moved to "/home/xxx/report10.qdrep"<br />stop executed</pre>
<div style="text-align: left;"><b>f. You can start&stop more collection windows.</b></div><div style="text-align: left;"><b>g. Stop Spark-worker in the end.<br /></b></div><h1 style="text-align: left;">References:</h1><ul style="text-align: left;"><li><a href="https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html" rel="nofollow" target="_blank">https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html </a></li><li><a href="https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html" rel="nofollow" target="_blank">https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html</a></li><li><a href="https://docs.nvidia.com/nsight-systems/UserGuide/index.html" rel="nofollow" target="_blank">https://docs.nvidia.com/nsight-systems/UserGuide/index.html </a></li><li><a href="https://www.youtube.com/watch?v=kKANP0kL_hk" rel="nofollow" target="_blank">Youtube: Profiling GPU Applications with Nsight Systems </a><br /></li></ul><p style="text-align: left;"> </p><p style="text-align: left;"> </p><p style="text-align: left;"> </p><p style="text-align: left;"> </p><p style="text-align: left;"> </p><p style="text-align: left;"> <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-72861893119155988922021-04-04T13:12:00.007-07:002021-04-04T13:16:52.567-07:00How to enable GpuKryoRegistrator on RAPIDS Accelerator for Spark<h1 style="text-align: left;">Goal:</h1><p>This article shares the steps to enable GpuKryoRegistrator on RAPIDS Accelerator for Spark.<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>Spark 3.1.1</p><p>RAPIDS Accelerator for Apache Spark 0.4.1</p><h1 style="text-align: left;">Solution:</h1><p>As mentioned in <a href="https://spark.apache.org/docs/latest/tuning.html" rel="nofollow" target="_blank">Spark Tuning Doc</a>:</p><ul><li><a href="https://docs.oracle.com/javase/8/docs/api/java/io/Serializable.html" rel="nofollow" target="_blank">Java serialization</a>:
By default, Spark serializes objects using Java’s <code class="language-plaintext highlighter-rouge">ObjectOutputStream</code> framework, and can work
with any class you create that implements
<a href="https://docs.oracle.com/javase/8/docs/api/java/io/Serializable.html" rel="nofollow" target="_blank"><code class="language-plaintext highlighter-rouge">java.io.Serializable</code></a>.
You can also control the performance of your serialization more closely by extending
<a href="https://docs.oracle.com/javase/8/docs/api/java/io/Externalizable.html" rel="nofollow" target="_blank"><code class="language-plaintext highlighter-rouge">java.io.Externalizable</code></a>.
Java serialization is flexible but often quite slow, and leads to large
serialized formats for many classes.</li><li><a href="https://github.com/EsotericSoftware/kryo" rel="nofollow" target="_blank">Kryo serialization</a>: Spark can also use
the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly
faster and more compact than Java serialization (often as much as 10x), but does not support all
<code class="language-plaintext highlighter-rouge">Serializable</code> types and requires you to <i>register</i> the classes you’ll use in the program in advance
for best performance.</li></ul><p>In Rapids Accelerator, it also has a class named <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuKryoRegistrator.scala" rel="nofollow" target="_blank">com.nvidia.spark.rapids.GpuKryoRegistrator</a> to use Kryo to register below classes<span class="pl-en"> in <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastExchangeExec.scala" rel="nofollow" target="_blank">org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExec</a> :<br /></span></p><ul style="text-align: left;"><li><span class="pl-en">SerializeConcatHostBuffersDeserializeBatch</span></li><li><span class="pl-en"><span class="pl-en">SerializeBatchDeserializeHostBuffer</span> </span> </li></ul><p> </p><h3 style="text-align: left;">How to enable?</h3><p>Set below 2 parameters(eg, in spark-defaults.conf): <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.serializer org.apache.spark.serializer.KryoSerializer<br />spark.kryo.registrator com.nvidia.spark.rapids.GpuKryoRegistrator</pre>
<h3 style="text-align: left;">Common Issues<br /></h3><p>This is a common issue in <a href="https://github.com/EsotericSoftware/kryo" rel="nofollow" target="_blank">Kryo serialization</a> : Buffer overflow.</p><p>For example, when running Q7 of TPCDS/NDS, it may fail with:</p><pre class="brush:text; toolbar: false; auto-links: false;highlight: 1">Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 636<br /> at com.esotericsoftware.kryo.io.Output.require(Output.java:167)<br /> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)<br /> at com.esotericsoftware.kryo.io.Output.write(Output.java:219)<br /> at java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)<br /> at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712)<br /> at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)<br /> at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)<br /> at ai.rapids.cudf.JCudfSerialization$DataOutputStreamWriter.copyDataFrom(JCudfSerialization.java:600)<br /> at ai.rapids.cudf.JCudfSerialization$DataWriter.copyDataFrom(JCudfSerialization.java:546)<br /> at ai.rapids.cudf.JCudfSerialization.copySlicedAndPad(JCudfSerialization.java:1104)<br /> at ai.rapids.cudf.JCudfSerialization.copySlicedOffsets(JCudfSerialization.java:1332)<br /> at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1464)<br /> at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1517)<br /> at ai.rapids.cudf.JCudfSerialization.writeToStream(JCudfSerialization.java:1567)<br /> at org.apache.spark.sql.rapids.execution.SerializeBatchDeserializeHostBuffer.writeObject(GpuBroadcastExchangeExec.scala:153)<br /> at jdk.internal.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)<br /> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br /> at java.base/java.lang.reflect.Method.invoke(Method.java:566)<br /> at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1145)<br /> at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1497)<br /> at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)<br /> at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)<br /> at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)<br /> at com.esotericsoftware.kryo.serializers.JavaSerializer.write(JavaSerializer.java:51)<br /> ... 9 more</pre>
<p>The fix is to increase the <b><i>spark.kryoserializer.buffer.max</i></b> from default 64M to bigger, say 512M:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.kryoserializer.buffer.max 512m</pre>
<p><br /></p><p> </p><p> </p><p> <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-47499890864872518752021-04-02T17:21:00.000-07:002021-04-02T17:21:01.581-07:00 How to install a Kubernetes Cluster with NVIDIA GPU on AWS using DeepOps<h1 style="text-align: left;">Goal:</h1><p style="text-align: left;">This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS using <a href="https://github.com/NVIDIA/deepops" rel="nofollow" target="_blank">DeepOps</a>.<br /></p><span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p style="text-align: left;">AWS EC2 (G4dn)<br /></p><p style="text-align: left;">Ubuntu 18.04</p><h1 style="text-align: left;">Solution: </h1><p style="text-align: left;">Most of the steps are the same as previous blog post: <a href="http://www.openkb.info/2021/03/how-to-install-kubernetes-cluster-with.html" rel="nofollow" target="_blank">How to install a Kubernetes Cluster with NVIDIA GPU on AWS</a>. </p><p style="text-align: left;">In that previous blog, it uses kubeadm to manually install a Kubernetes Cluster by installing below components: Docker, NVIDIA Container Toolkit (nvidia-docker2) and NVIDIA Device Plugin.<br /></p><p style="text-align: left;">In this blog, we will just use <a href="https://github.com/NVIDIA/deepops" rel="nofollow" target="_blank">DeepOps</a> to do above work by following <a href="https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster" rel="nofollow" target="_blank">https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster</a>.<br /></p><p style="text-align: left;">So basically we just need to replace section #4 of previous blog with below steps. (So here let me use step 4 as a starting point.)</p><h3 style="text-align: left;">4.1 Download DeepOps repo<br /></h3><p style="text-align: left;">On the EC2 machine:</p>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">git clone https://github.com/NVIDIA/deepops.git<br />cd deepops \<br /> && git checkout tags/20.10</pre>
<h3 style="text-align: left;">4.2 Install ansible and other needed software <br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">./scripts/setup.sh</pre>
<h3 style="text-align: left;">4.3 Edit inventory and add nodes to the "KUBERNETES" section<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">vi config/inventory</pre>
<p style="text-align: left;">Note: Since this is a single-node cluster, we need to add the same `hostname` to [kube-master], [etcd] and [kube-node] section.<br /></p><h3 style="text-align: left;">4.4 Verify the configuration </h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">ansible all -m raw -a "hostname"</pre>
<p style="text-align: left;"></p><h3 style="text-align: left;">4.5 Install Kubernetes using Ansible and Kubespray.<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml</pre>
<h3 style="text-align: left;">4.6 Test K8s cluster <br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false" style="text-align: left;">kubectl get nodes<br />kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 --limits=nvidia.com/gpu=1 nvidia-smi</pre><h1 style="text-align: left;">Issues:</h1><h3 style="text-align: left;">1. There are 2 CoreDNS PODs with 1 POD pending<br /></h3>
<pre class="brush:bash; toolbar: false; auto-links: false"># kubectl get pods -A |grep coredns<br />kube-system coredns-123 0/1 Pending 0 2m40s<br />kube-system coredns-456 1/1 Running 0 64m</pre>
<p style="text-align: left;">If we describe this pending POD, we got to know this is due to <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/" rel="nofollow" target="_blank">pod affinity/anti-affinity</a> since we have only 1 node in this K8s cluster.<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false"># kubectl describe pod coredns-123 -n kube-system |grep affinity<br /> Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.<br /> Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.</pre>
<p style="text-align: left;">CoreDNS deployment have 2 desired PODs:</p>
<pre class="brush:bash; toolbar: false; auto-links: false"># kubectl describe deployment.apps -n kube-system coredns |grep desired<br />Replicas: 2 desired | 2 updated | 2 total | 1 available | 1 unavailable</pre>
<p style="text-align: left;">One way to resolve this in my first thought is to manually scale down deployment CoreDNS as below:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">kubectl scale deployments.apps -n kube-system coredns --replicas=1</pre>
<p style="text-align: left;">However it did not work.</p><p style="text-align: left;">The reason is by default, deployment dns-autoscaler is also installed, so the final fix is to:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">kubectl edit configmap dns-autoscaler --namespace=kube-system</pre>
<p style="text-align: left;">In above configMap, change <i><b>"min":2</b></i> to <i><b>"min":1</b></i>.</p><p style="text-align: left;">After that, if you describe CoreDNS again, it will show it got scaled down to 1:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false"># kubectl describe deployment.apps -n kube-system coredns<br />Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable<br /> Normal ScalingReplicaSet 21s (x2 over 12m) deployment-controller Scaled down replica set coredns-xxx to 1</pre>
<p style="text-align: left;">Eventually you can delete the pending coreDNS pod if it is still there:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">kubectl delete pods coredns-123 -n kube-system</pre>
<h3 style="text-align: left;">2. CoreDNS pod crashed with the reason as "OOMKilled"<br /></h3><p style="text-align: left;">If we describe the crashed POD, we can get below reason:</p><pre class="brush:bash; toolbar: false; auto-links: false"> State: Waiting<br /> Reason: CrashLoopBackOff<br /> Last State: Terminated<br /> Reason: OOMKilled<br /> Exit Code: 137<br /> Started: Fri, 02 Apr 2021 21:32:12 +0000<br /> Finished: Fri, 02 Apr 2021 21:32:21 +0000<br /> Ready: False<br /> Restart Count: 3<br /> Limits:<br /> memory: 170Mi<br /> Requests:<br /> cpu: 100m<br /> memory: 70Mi</pre>
<p style="text-align: left;">This is because by default, CoreDNS POD has 170MB memory limit which may be too small for big cluster. Here are some <a href="https://github.com/coredns/coredns/issues/3388" rel="nofollow" target="_blank">reported occurrence</a> as well.<br /></p><p style="text-align: left;">The fix is straightforward, just increase the deployment CoreDNS' resource limit:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">kubectl set resources deployment.v1.apps/coredns --limits=cpu=1000m,memory=1024Mi</pre>
<h3 style="text-align: left;">3. Spark on Kubernetes Job in client mode keeps failing<br /></h3>The Spark Driver may keep printing below messages:<br /><pre class="brush:bash; toolbar: false; auto-links: false">Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.</pre><p style="text-align: left;">The Spark Executor may keeps crashing and restarting, but if we use "kubectl logs" to check the Executor POD, we will get the root cause:</p>
<pre class="brush:bash; toolbar: false; auto-links: false;highlight: 1">Caused by: java.net.UnknownHostException: ip-xxx-xxx-xxx-xxx.cluster.local</pre>
<p style="text-align: left;">It means the POD can not resolve the hostname of the node. <br /></p><p style="text-align: left;">If we spin-off a "busybox" POD to test DNS to troubleshoot:</p><p style="text-align: left;"><b>a. Create busybox.yaml with below content:</b><br /></p><pre class="brush:bash; toolbar: false; auto-links: false">apiVersion: v1<br />kind: Pod<br />metadata:<br /> name: busybox<br /> namespace: default<br />spec:<br /> containers:<br /> - image: busybox<br /> command:<br /> - sleep<br /> - "3600"<br /> imagePullPolicy: IfNotPresent<br /> name: busybox<br /> restartPolicy: Always</pre>
<p style="text-align: left;"><b>b. Test the DNS resolution in the sample "busybox" POD: </b><br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">kubectl create -f busybox.yaml<br />kubectl exec -ti busybox -- cat /etc/resolv.conf<br />kubectl exec -ti busybox -- nslookup ip-xxx-xxx-xxx-xxx.cluster.local</pre>
<p style="text-align: left;">We will get to know that both /etc/resolv.conf has default DNS server as "169.254.25.10" which can not resolve the <b><i>hostname -f</i></b> of the machine.</p><p style="text-align: left;">So what is this IP 169.254.25.10?</p><p style="text-align: left;">As we know by default, <a href="https://github.com/kubernetes-sigs/kubespray" rel="nofollow" target="_blank">kubespray</a> enables nodelocal dns cache with default IP as 169.254.25.10.<br /></p><p style="text-align: left;">So it creates a new IP address for this machine if you check "ifconfig": <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false"># ifconfig -a |grep 169.254.25.10<br /> inet 169.254.25.10 netmask 255.255.255.255 broadcast 169.254.25.10<br /><br /># ps -ef|grep 169.254.25.10|grep -v grep<br />root 111 222 0 xx:xx ? 00:00:45 /node-cache -localip 169.254.25.10 -conf /etc/coredns/Corefile -upstreamsvc coredns<br /><br /># kubectl get pods -A |grep nodelocaldns<br />kube-system nodelocaldns-xxxxx 1/1 Running 0 161m</pre>
<p style="text-align: left;"><br /></p><p style="text-align: left;">Eventually I found out the root cause:</p><p style="text-align: left;">The <b><i>hostname</i></b> and <b><i>hostname -f</i></b> on the EC2 machine return different results:</p><p style="text-align: left;"><b><i>hostname</i></b> returns "ip-xxx-xxx-xxx-xxx.<span style="color: red;">ec2.internal</span>" however <b><i>hostname -f </i></b>returns "ip-xxx-xxx-xxx-xxx<span style="color: #ff00fe;">.cluster.local</span>".</p><p style="text-align: left;">This is because below entry was added by Ansible in /etc/hosts:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false"># Ansible inventory hosts BEGIN<br />xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.cluster.local ip-xxx-xxx-xxx-xxx ip-xxx-xxx-xxx-xxx.ec2.internal.cluster.local ip-xxx-xxx-xxx-xxx.ec2.internal</pre>
<p style="text-align: left;">After removing above entries from /etc/hosts, <b><i>hostname</i></b> and <i><b>hostname -f</b></i> are matched now -- "ip-xxx-xxx-xxx-xxx.<span style="color: red;">ec2.internal</span>".</p><p style="text-align: left;">Basically we just let DNS server to resolve the <b><i>hostname</i></b>.</p><p style="text-align: left;">Now the spark on kubernetes job in client mode works fine.<br /></p><p style="text-align: left;"><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-4349135494914277622021-03-30T22:07:00.007-07:002021-06-14T14:51:12.356-07:00How to install a Kubernetes Cluster with NVIDIA GPU on AWS<h1 style="text-align: left;">Goal:</h1><p>This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS. </p><p>It includes spinning up an AWS EC2 instance, installing NVIDIA drivers&cudatoolkit, installing Kubernetes Cluster with GPU support, and eventually ran a Spark+Rapids job to test it.<br /></p><span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p>AWS EC2 (G4dn)<br /></p><p>Ubuntu 18.04</p><h1 style="text-align: left;">Solution: <br /></h1><h3 style="text-align: left;">1. Spin up an AWS EC2 instance with NVIDIA GPU<br /></h3><p>Here I choose "Ubuntu Server 18.04 LTS (HVM), SSD Volume Type" base image.<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKtV5Ajs0u_nJDClO8jXpo3q9KA2LRrcVbSeKKJ7Rz94KbZzRKME1mrpwFr2d6xK174fK9yjZmmHTySbh7ed21geGFZIiu_nbEyosKUF-mApwWuwPvbinfCWG4S4i4bxFxovrk-fxld7w/s1746/Screen+Shot+2021-03-30+at+2.09.08+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="352" data-original-width="1746" height="129" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKtV5Ajs0u_nJDClO8jXpo3q9KA2LRrcVbSeKKJ7Rz94KbZzRKME1mrpwFr2d6xK174fK9yjZmmHTySbh7ed21geGFZIiu_nbEyosKUF-mApwWuwPvbinfCWG4S4i4bxFxovrk-fxld7w/w640-h129/Screen+Shot+2021-03-30+at+2.09.08+PM.png" width="640" /></a></div><p>Choose "Instance Type": g4dn.2xlarge (8vCPU, 32G memory, 1x 225 SSD).</p><p><b>Note: <a href="https://aws.amazon.com/ec2/instance-types/g4/" rel="nofollow" target="_blank">EC2 G4dn instance</a> has NVIDIA T4 GPU(s) attached. </b><br /></p><p>Go to "<b>Step 3: Configure Instance Details</b>": Auto-assign Public IP=Enable.</p><p>Go to "<b>Step 4: Add Storage</b>": Increase the Root Volume from default 8G to 200G.</p><p></p><p>Go to "<b>Step 6: Configure Security Group</b>": Create a security group with ssh only allowed from your public IP address.</p><p>Eventually "<b>Launch</b>" and select an existing key pair or create a new key pair.<br /></p><h3 style="text-align: left;">2. SSH to the EC2 instance</h3><p>Please follow <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html" rel="nofollow" target="_blank">the Doc on how to ssh to EC2 instance</a>.<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">ssh -i /path/my-key-pair.pem ubuntu@my-instance-public-dns-name<br />sudo su - root</pre>
<h3 style="text-align: left;">3. Install NVIDIA Driver and cudatoolkit<br /></h3><p>Please follow this blog on <a href="http://www.openkb.info/2021/03/how-to-intall-cuda-toolkit-and-nvidia.html" target="_blank">How to intall CUDA Toolkit and NVIDIA Driver on Ubuntu (step by step)</a>.<br /></p><p>Make sure "<i><b>nvidia-smi</b></i>" returns correct results. </p><p>Below is a lazy-man's script to install CUDA 11.0.3 with NVIDIA Driver 450.51.06 on ubuntu x86-64 run by root user after you logon this EC2 machine:<br /></p><p>(Note: Please validate it carefully yourself!)</p>
<pre class="brush:bash; toolbar: false; auto-links: false">apt-get update<br />apt install -y gcc<br />apt-get install -y linux-headers-$(uname -r)<br />wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin<br />mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600<br />wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb<br />dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb<br />apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub<br />apt-get update<br />apt-get install -y cuda<br />printf "export PATH=/usr/local/cuda/bin\${PATH:+:\${PATH}}\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64{LD_LIBRARY_PATH:+:\${LD_LIBRARY_PATH}}" >> ~/.bashrc<br />nvidia-smi</pre>
<h3 style="text-align: left;">4. Install a Kubernetes Cluster with NVIDIA GPU<br /></h3><p>Please follow this NVIDIA Doc on <a href="https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html" rel="nofollow" target="_blank">how to install a Kubernetes Cluster </a>with NVIDIA GPU attached.<br /></p><p>Here I choose to use "Option 2" which is to use <i><b>kubeadm</b></i>.</p><h4 style="text-align: left;">4.1 Install Docker<br /></h4><pre class="brush:bash; toolbar: false; auto-links: false">curl https://get.docker.com | sh \<br /> && sudo systemctl --now enable docker</pre>
<h4 style="text-align: left;">4.2 Install kubeadm</h4><p>Please follow this K8s Doc on <a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm" rel="nofollow" target="_blank">how to install kubeadm</a>. </p><h4 style="text-align: left;">4.3 Init a Kubernetes Cluster</h4>
<pre class="brush:bash; toolbar: false; auto-links: false">kubeadm init --pod-network-cidr=192.168.0.0/16</pre>
<p>Then follow the printed steps in the end to start using the cluster. <br /></p><h4 style="text-align: left;">4.4 Configure network <br /></h4>
<pre class="brush:bash; toolbar: false; auto-links: false">kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml<br />kubectl taint nodes --all node-role.kubernetes.io/master-</pre>
<h4 style="text-align: left;">4.5 Check the Nodes which should be in "Ready" status <br /></h4><pre class="brush:bash; toolbar: false; auto-links: false"># kubectl get nodes<br />NAME STATUS ROLES AGE VERSION<br />ip-xxx-xxx-xx-xx Ready control-plane,master 11m v1.20.5</pre>
<h4 style="text-align: left;">4.6 Install NVIDIA Container Toolkit (nvidia-docker2)</h4><div style="text-align: left;">Setup the stable repository for the NVIDIA runtime and the GPG key:<br /></div>
<pre class="brush:bash; toolbar: false; auto-links: false">distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \<br /> && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \<br /> && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list</pre>
<p>Then install <b><i>nvidia-docker2</i></b> package and its dependencies: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">sudo apt-get update \<br /> && sudo apt-get install -y nvidia-docker2</pre>
<p>Add "default-runtime" set to "nvidia" into /etc/docker/daemon.json:</p><pre class="brush:bash; toolbar: false; auto-links: false">{<br /> "default-runtime": "nvidia",<br /> "runtimes": {<br /> "nvidia": {<br /> "path": "/usr/bin/nvidia-container-runtime",<br /> "runtimeArgs": []<br /> }<br /> }<br />}</pre><p>Restart Docker daemon: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">sudo systemctl restart docker</pre>
<p>Test a base CUDA container: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi</pre>
<h4 style="text-align: left;">4.7 Install NVIDIA Device Plugin <br /></h4><p>Firstly install helm which is the preferred option: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \<br /> && chmod 700 get_helm.sh \<br /> && ./get_helm.sh</pre>
<p>Add the <i><b>nvidia-device-plugin</b></i> helm repository:</p><pre class="brush:bash; toolbar: false; auto-links: false">helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \<br /> && helm repo update</pre>
<p>Deploy the device plugin:</p><pre class="brush:bash; toolbar: false; auto-links: false">helm install --generate-name nvdp/nvidia-device-plugin</pre>
<p>Check current running PODs to make sure nvidia-device-plugin-xxx POD is running:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">kubectl get pods -A</pre>
<h4 style="text-align: left;">4.8 Test CUDA job <br /></h4><p>Create gpu-pod.yaml with below content: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">apiVersion: v1<br />kind: Pod<br />metadata:<br /> name: gpu-operator-test<br />spec:<br /> restartPolicy: OnFailure<br /> containers:<br /> - name: cuda-vector-add<br /> image: "nvidia/samples:vectoradd-cuda10.2"<br /> resources:<br /> limits:<br /> nvidia.com/gpu: 1</pre>
<p>Deploy this sample POD:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">kubectl apply -f gpu-pod.yaml</pre>
<p>After the POD completes successfully, check the logs to double confirm:</p>
<pre class="brush:bash; toolbar: false; auto-links: false;highlight: 1"># kubectl logs gpu-operator-test<br />[Vector addition of 50000 elements]<br />Copy input data from the host memory to the CUDA device<br />CUDA kernel launch with 196 blocks of 256 threads<br />Copy output data from the CUDA device to the host memory<br />Test PASSED<br />Done</pre>
<h3 style="text-align: left;">5. Test a Spark+Rapids on K8s job</h3><p>Please follow this Doc on <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a>.</p><p>Please also refer to <a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html" rel="nofollow" target="_blank">Spark on K8s Doc</a> to get familiar with the basics. </p><p>For example, here we assume you know how to create service account and assign proper role to that service account.<br /></p><h4 style="text-align: left;">5.1 Create a service account named "spark" to run spark jobs<br /></h4>
<pre class="brush:bash; toolbar: false; auto-links: false">kubectl create serviceaccount spark<br />kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default</pre>
<h4 style="text-align: left;">5.2 Capture the cluster-info<br /></h4><pre class="brush:bash; toolbar: false; auto-links: false">kubectl cluster-info</pre><p>Take the notes of the "Kubernetes control plane" URL which will be used in spark job.<br /></p><h4 style="text-align: left;">5.3 Run sample spark jobs <br /></h4><p>Follow all the steps in <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/docs/get-started/getting-started-kubernetes.md" rel="nofollow" target="_blank">Getting Started with RAPIDS and Kubernetes</a> to run sample Spark job in cluster or client mode.</p><p>Here we are using "spark" service account to run the Spark jobs with below extra option:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark </pre>
<p><br /></p><h1 style="text-align: left;">References:</h1><p><a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html" rel="nofollow" target="_blank">https://spark.apache.org/docs/latest/running-on-kubernetes.html</a></p><p><a href="https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/" rel="nofollow" target="_blank">https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/</a></p><p><a href="https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html" rel="nofollow" target="_blank">https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html</a><br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-56288279712148724292021-03-25T14:34:00.008-07:002021-04-02T08:38:17.868-07:00concat_ws example on Spark with RAPIDS Accelerator<h1 style="text-align: left;">Goal:</h1><p>This is a quick example of operator <i><b>contact_ws</b></i> on Spark with RAPIDS Accelerator.<br /><span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>Spark 3.1.1</p><p>RAPIDS Accelerator for Apache Spark 0.4.1</p><h1 style="text-align: left;">Solution:</h1><h3 style="text-align: left;">1. <i>concat_ws</i> can convert an Array of Strings to a String with a separator. </h3><p>Below is a quick example using scala:</p><pre class="brush:sql; toolbar: false; auto-links: false">import org.apache.spark.sql.Row<br />import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, ArrayType}<br /><br />val data = Seq(<br /> Row(1, List("orange", "banana", "apple")),<br /> Row(2, List("a", "b", "c"))<br />)<br /><br />val schema = StructType(Array(<br /> StructField("idx",IntegerType,true),<br /> StructField("arrays",ArrayType(StringType),true)<br />))<br /><br />val df = spark.createDataFrame( spark.sparkContext.parallelize(data),schema )<br />val df2 = df.withColumn("concat_array", concat_ws(",",col("arrays")))<br />df2.show()<br />df2.explain()</pre><p>The output with RAPIDS Accelerator for Apache Spark 0.4.1 is :<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [17,25]">scala> df2.show<br /><br />+---+--------------------+-------------------+<br />|idx| arrays| concat_array|<br />+---+--------------------+-------------------+<br />| 1|[orange, banana, ...|orange,banana,apple|<br />| 2| [a, b, c]| a,b,c|<br />+---+--------------------+-------------------+<br /><br /><br />scala> df2.explain()<br />21/03/25 21:01:48 WARN GpuOverrides:<br />!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced<br /> @Expression <AttributeReference> idx#2 could run on GPU<br /> @Expression <AttributeReference> arrays#3 could run on GPU<br /> @Expression <Alias> concat_ws(,, arrays#3) AS concat_array#15 could run on GPU<br /> !NOT_FOUND <ConcatWs> concat_ws(,, arrays#3) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ConcatWs could be found<br /> @Expression <Literal> , could run on GPU<br /> @Expression <AttributeReference> arrays#3 could run on GPU<br /> !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found<br /> @Expression <AttributeReference> idx#2 could run on GPU<br /> @Expression <AttributeReference> arrays#3 could run on GPU<br /><br />== Physical Plan ==<br />*(1) Project [idx#2, arrays#3, concat_ws(,, arrays#3) AS concat_array#15]<br />+- *(1) Scan ExistingRDD[idx#2,arrays#3]</pre>
<p>As you can see, concat_ws is not supported on RAPIDS Accelerator 0.4.1 since it falls back to CPU.</p><h3 style="text-align: left;">2. <i>concat_ws</i> can concatenate multiple columns together with a separator.<br /></h3><pre class="brush:sql; toolbar: false; auto-links: false">import org.apache.spark.sql.Row<br />import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}<br /><br />val data = Seq(<br /> Row(1, "orange", "banana", "apple"),<br /> Row(2, "a", "b", "c")<br />)<br /><br />val schema = StructType(Array(<br /> StructField("idx",IntegerType,true),<br /> StructField("s1",StringType,true),<br /> StructField("s2",StringType,true),<br /> StructField("s3",StringType,true)<br />))<br /><br />val df = spark.createDataFrame( spark.sparkContext.parallelize(data),schema )<br />val df2 = df.withColumn("concat_array", concat_ws(",",col("idx"), col("s1"), col("s2"), col("s3") ))<br />df2.show()<br />df2.explain()</pre><p>The output with RAPIDS Accelerator for Apache Spark 0.4.1 is : <br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [19,33]">scala> df2.show<br /><br />+---+------+------+-----+--------------------+<br />|idx| s1| s2| s3| concat_array|<br />+---+------+------+-----+--------------------+<br />| 1|orange|banana|apple|1,orange,banana,a...|<br />| 2| a| b| c| 2,a,b,c|<br />+---+------+------+-----+--------------------+<br /><br /><br />scala> df2.explain()<br />21/03/25 21:19:11 WARN GpuOverrides:<br />!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced<br /> @Expression <AttributeReference> idx#65 could run on GPU<br /> @Expression <AttributeReference> s1#66 could run on GPU<br /> @Expression <AttributeReference> s2#67 could run on GPU<br /> @Expression <AttributeReference> s3#68 could run on GPU<br /> @Expression <Alias> concat_ws(,, cast(idx#65 as string), s1#66, s2#67, s3#68) AS concat_array#73 could run on GPU<br /> !NOT_FOUND <ConcatWs> concat_ws(,, cast(idx#65 as string), s1#66, s2#67, s3#68) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ConcatWs could be found<br /> @Expression <Literal> , could run on GPU<br /> @Expression <Cast> cast(idx#65 as string) could run on GPU<br /> @Expression <AttributeReference> idx#65 could run on GPU<br /> @Expression <AttributeReference> s1#66 could run on GPU<br /> @Expression <AttributeReference> s2#67 could run on GPU<br /> @Expression <AttributeReference> s3#68 could run on GPU<br /> !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found<br /> @Expression <AttributeReference> idx#65 could run on GPU<br /> @Expression <AttributeReference> s1#66 could run on GPU<br /> @Expression <AttributeReference> s2#67 could run on GPU<br /> @Expression <AttributeReference> s3#68 could run on GPU<br /><br />== Physical Plan ==<br />*(1) Project [idx#65, s1#66, s2#67, s3#68, concat_ws(,, cast(idx#65 as string), s1#66, s2#67, s3#68) AS concat_array#73]<br />+- *(1) Scan ExistingRDD[idx#65,s1#66,s2#67,s3#68]</pre>
<p>Same here <i><b>concat_ws</b></i> is not supported on RAPIDS Accelerator 0.4.1 since it falls back to CPU.</p><p>Let's compare this scenario to a <b><i>concat</i></b> operator: <br /></p><pre class="brush:sql; toolbar: false; auto-links: false">val df3 = df.withColumn("concat_array", concat(col("idx"), lit(','), col("s1"), lit(','), col("s2"), lit(','), col("s3") ))<br />df3.show()<br />df3.explain()</pre><p>Output for <b><i>concat</i></b> with RAPIDS Accelerator for Apache Spark 0.4.1 is : </p><pre class="brush:sql; toolbar: false; auto-links: false;highlight: 21">scala> df3.show()<br /><br />+---+------+------+-----+--------------------+<br />|idx| s1| s2| s3| concat_array|<br />+---+------+------+-----+--------------------+<br />| 1|orange|banana|apple|1,orange,banana,a...|<br />| 2| a| b| c| 2,a,b,c|<br />+---+------+------+-----+--------------------+<br /><br /><br />scala> df3.explain()<br />21/03/25 21:26:28 WARN GpuOverrides:<br /> !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found<br /> @Expression <AttributeReference> idx#65 could run on GPU<br /> @Expression <AttributeReference> s1#66 could run on GPU<br /> @Expression <AttributeReference> s2#67 could run on GPU<br /> @Expression <AttributeReference> s3#68 could run on GPU<br /><br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [idx#65, s1#66, s2#67, s3#68, gpuconcat(cast(idx#65 as string), ,, s1#66, ,, s2#67, ,, s3#68) AS concat_array#100]<br /> +- GpuRowToColumnar TargetSize(2147483647)<br /> +- *(1) Scan ExistingRDD[idx#65,s1#66,s2#67,s3#68]</pre><p>Since <b><i>concat</i></b> is a supported operator, as you can see above, it is running on GPU using "GpuProject". <br /></p><p>In this scenario, if you want, you can use <b><i>concat</i></b> to rewrite <b><i>conact_ws</i></b> to make it run on GPU in RAPIDS Accelerator 0.4.1 version. </p><p><b>Note: above tests are based on RAPIDS Accelerator 0.4.1. Future versions should have more supported operators.</b><br /></p><p>For supported operators in RAPIDS Accelerator, please always refer to <a href="https://nvidia.github.io/spark-rapids/docs/supported_ops.html" rel="nofollow" target="_blank">this RAPIDS Accelerator Doc</a>. </p><p><br /></p><p> <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-26258618642598868922021-03-24T14:50:00.012-07:002021-04-02T08:38:21.146-07:00Hands-on native cuDF Pandas UDF<h1 style="text-align: left;">Goal:</h1><p>This article will help show some hands-on steps to play with native cuDF Pandas UDF on Spark with <a href="https://nvidia.github.io/spark-rapids" rel="nofollow" target="_blank">RAPIDS Accelerator for Apache Spark</a>.<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>RAPIDS Accelerator for Apache Spark 0.4.1</p><p>Spark 3.1.1</p><p>RTX 6000 GPU<br /></p><h1 style="text-align: left;">Concept:</h1><p>As we know, Spark introduced <a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" rel="nofollow" target="_blank">Pandas UDFs</a> (a.k.a. Vectorized UDFs) feature in the Spark 2.3 and brings huge performance gains.</p><p>Here we will introduce the native cuDF version Pandas UDF(which can run on GPU natively) with RAPIDS Accelerator for Apache Spark enabled for Spark.</p><p>Below <a href="https://nvidia.github.io/spark-rapids/docs/configs.html" rel="nofollow" target="_blank">parameters</a> controls this behavior:<br /></p><ul style="text-align: left;"><li><b><i>spark.rapids.python.concurrentPythonWorkers</i></b> : Number of Python worker processes that can execute concurrently per GPU. </li><li><b><i>spark.rapids.python.memory.gpu.allocFraction</i></b> : The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers. </li><li><b><i>spark.rapids.python.memory.gpu.maxAllocFraction</i></b> : The fraction of total GPU memory that limits the maximum size of the RMM pool for all the Python workers. </li><li><b><i>spark.rapids.python.memory.gpu.pooling.enabled</i></b> : Should RMM in Python workers act as a pooling allocator for GPU memory,
or should it just pass through to CUDA memory allocation directly. </li></ul><p>If we enable this feature, Python worker processes will share and allocate the GPU memory with Spark Executors. Please read <a href="http://www.openkb.info/2021/03/understanding-rapids-accelerator-for_10.html" target="_blank">this post</a> for more details on GPU pool memory allocation for Spark+RAPIDS.<br /></p><p>As a result, we need to divide the GPU memory between Spark Executors and Python worker process.</p><p>Here I am allocating 40% GPU memory for Python workers and 50% for Spark Executors by setting below in spark-defaults.conf:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.rapids.sql.python.gpu.enabled true<br />spark.rapids.memory.gpu.allocFraction 0.5<br />spark.rapids.python.memory.gpu.allocFraction 0.4<br />spark.rapids.python.memory.gpu.maxAllocFraction 0.4</pre>
<p>And then I decide to spin off 2 concurrent Python workers:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">spark.rapids.python.concurrentPythonWorkers 2</pre>
<p>Since RTX 6000 has 24G GPU memory, after that, when Python workers are running, you may see below DEBUG log from Executor log:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">DEBUG: Pooled memory, pool size: 4844.0625 MiB, max size: 8796093022208.0 MiB</pre>
<p>This means, 24G * 0.4 / 2 = 4.8G.</p><p><b>Note: Since the default spark.rapids.memory.gpu.allocFraction=0.9, if we did not do the memory allocation properly, you may hit below error in some tasks' logs:</b><br /></p><pre class="brush:text; toolbar: false; auto-links: false">MemoryError: std::bad_alloc: RMM failure at:/home/xxx/xxx/envs/rapids-0.18/include/rmm/mr/device/pool_memory_resource.hpp:188: Maximum pool size exceeded</pre><h1 style="text-align: left;">Solution:</h1><h3 style="text-align: left;">1. Python dependency is cuDF </h3><p>Make sure you installed cuDF library in your python env.<br /></p><p>You can follow this <a href="https://rapids.ai/start.html" rel="nofollow" target="_blank">rapids.ai started guide</a> to install the libraries in your conda env on all nodes. <br /></p><p>For example:</p><pre class="brush:bash; toolbar: false; auto-links: false">conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \<br /> -c defaults cudf=0.18 python=3.8 cudatoolkit=11.0</pre>
<p>Note: If you can not install cuDF library on all nodes due to some reason, then you may need to package the whole conda env and distribute it to all Spark Executors which could be very time consuming. For example, in <a href="http://www.openkb.info/2021/03/how-to-run-pandas-cudfudf-test-for.html" target="_blank">this post </a>I used this way to do run the test framework. <br /></p><p>After that, make sure the python for pyspark is pointing to the correct conda env by setting PYSPARK_PYTHON in spark-env.sh on all nodes:</p>
<pre class="brush:bash; toolbar: false; auto-links: false">export PYSPARK_PYTHON=/xxx/xxx/MYGLOBALENV/rapids-0.18/bin/python</pre>
<h3 style="text-align: left;">2. RAPIDS Accelerator for Apache Spark is setup properly<br /></h3><p>I am assuming you have set RAPIDS Accelerator for Apache Spark related parameters properly and RAPIDS Accelerator for Apache Spark is working fine already.<br /></p><p>Especially, the <b><i>spark.driver.extraJavaOptions</i></b>, <b><i>spark.executor.extraJavaOptions</i></b> should use UTC JVM timezone as per <a href="http://www.openkb.info/2021/03/understanding-rapids-accelerator-for_19.html" target="_blank">this post</a>. </p><p><b><i>spark.executor.extraClassPath</i></b> and <b><i>spark.driver.extraClassPath</i></b> should include the cudf jar and rapids-4-spark jar.<br /></p><h3 style="text-align: left;">3. Launch pyspark and test different kinds of UDFs<br /></h3><pre class="brush:python; toolbar: false; auto-links: false">pyspark --conf spark.executorEnv.PYTHONPATH="/home/xxx/spark/rapids/rapids-4-spark_2.12-0.4.1.jar" </pre>
<p>Here make sure you specify the correct jar path for rapids-4-spark jar.</p><p>Import needed python libs and create a sample dataframe:</p><pre class="brush:python; toolbar: false; auto-links: false">import pyspark<br />from pyspark.sql.functions import udf<br />from pyspark.sql.functions import pandas_udf, PandasUDFType<br />import cudf<br />import pandas as pd<br /><br /># Prepare sample data<br />small_data = [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)]<br />df = spark.createDataFrame(small_data, ("id", "v")) </pre><h4 style="text-align: left;">3.a row-at-a-time UDF<br /></h4>
<pre class="brush:python; toolbar: false; auto-links: false"># Use udf to define a row-at-a-time udf<br />@udf('double')<br /># Input/output are both a single double value<br />def plus_one(v):<br /> return v + 1<br /><br />df.withColumn('v2', plus_one(df.v)).show()<br />df.withColumn('v2', plus_one(df.v)).explain()</pre>
<p>Output:</p>
<pre class="brush:python; toolbar: false; auto-links: false; highlight:[6,14]">21/03/24 17:58:42 WARN GpuOverrides:<br /> !NOT_FOUND <BatchEvalPythonExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.python.BatchEvalPythonExec could be found<br /> @Expression <PythonUDF> plus_one(v#1) could not block GPU acceleration<br /> @Expression <AttributeReference> v#1 could run on GPU<br /> @Expression <AttributeReference> pythonUDF0#28 could run on GPU<br /> !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found<br /> @Expression <AttributeReference> id#0L could run on GPU<br /> @Expression <AttributeReference> v#1 could run on GPU<br /><br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [id#0L, v#1, pythonUDF0#28 AS v2#24]<br /> +- GpuRowToColumnar TargetSize(2147483647)<br /> +- BatchEvalPython [plus_one(v#1)], [pythonUDF0#28]<br /> +- *(1) Scan ExistingRDD[id#0L,v#1]</pre>
<p>As we can see, "BatchEvalPython" is not running on GPU.</p><h4 style="text-align: left;">3.b Pandas UDF</h4><div style="text-align: left;">To test the query plan or performance, we need to disable above cuDF Pandas UDF related parameters such as <b><i>spark.rapids.sql.python.gpu.enabled</i></b>. <br /></div><pre class="brush:python; toolbar: false; auto-links: false"># Use pandas_udf to define a Pandas UDF<br />@pandas_udf('double', PandasUDFType.SCALAR)<br /># Input/output are both a pandas.Series of doubles<br />def pandas_plus_one(v: pd.Series) -> pd.Series:<br /> return v + 1<br /><br />df.withColumn('v2', pandas_plus_one(df.v)).show()<br />df.withColumn('v2', pandas_plus_one(df.v)).explain()</pre><p>Output:</p>
<pre class="brush:python; toolbar: false; auto-links: false;highlight: 5">== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [id#102L, v#103, pythonUDF0#136 AS v2#132]<br /> +- GpuCoalesceBatches TargetSize(2147483647)<br /> +- GpuArrowEvalPython [pandas_plus_one(v#103)], [pythonUDF0#136], 200<br /> +- GpuRowToColumnar TargetSize(2147483647)<br /> +- *(1) Scan ExistingRDD[id#102L,v#103]</pre>
<p>As we can see, it is done by GpuArrowEvalPython.</p><p>From Spark Executor log, "<b>PythonUDFRunner</b>" is started to do the work.</p><p>When the job is running, the python daemon processes are "<b>pyspark.daemon</b>":</p><pre class="brush:bash; toolbar: false; auto-links: false">python -m pyspark.daemon<br />...<br />python -m pyspark.daemon</pre><h4 style="text-align: left;">3.c cuDF Pandas UDF<br /></h4><pre class="brush:python; toolbar: false; auto-links: false">@pandas_udf('double')<br />def cudf_pandas_plus_one(v: pd.Series) -> pd.Series: <br /> gpu_series = cudf.Series(v)<br /> gpu_series = gpu_series + 1<br /> return gpu_series.to_pandas()<br /><br />df.withColumn('v2', cudf_pandas_plus_one(df.v)).show()<br />df.withColumn('v2', cudf_pandas_plus_one(df.v)).explain()</pre><p>Output:<br /></p>
<pre class="brush:python; toolbar: false; auto-links: false;highlight: 5">== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [id#0L, v#1, pythonUDF0#74 AS v2#70]<br /> +- GpuCoalesceBatches TargetSize(2147483647)<br /> +- GpuArrowEvalPython [cudf_pandas_plus_one(v#1)], [pythonUDF0#74], 200<br /> +- GpuRowToColumnar TargetSize(2147483647)<br /> +- *(1) Scan ExistingRDD[id#0L,v#1]</pre>
<p>As we can see, it is done by GpuArrowEvalPython. The same plan as above 3.b.</p><p>From Spark Executor log, "<b>GpuArrowPythonRunner</b>" is started to do the work.</p><p>When the job is running, the python daemon processes are "<b>rapids.daemon</b>":</p><pre class="brush:bash; toolbar: false; auto-links: false">python -m rapids.daemon<br />...<br />python -m rapids.daemon</pre><p>For more types of native cuDF pandas UDF, please refer to <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/src/main/python/udf_cudf_test.py" rel="nofollow" target="_blank">this test python code</a>.<br /></p><h1 style="text-align: left;">References:</h1><p><a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" rel="nofollow" target="_blank">https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html</a><br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-87960745958789269272021-03-23T23:51:00.004-07:002021-04-02T08:38:23.371-07:00How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark<h1 style="text-align: left;">Goal:</h1><p>How to run the pandas cudf_udf test for <a href="https://github.com/NVIDIA/spark-rapids" rel="nofollow" target="_blank">RAPIDS Accelerator for Apache Spark</a>.<br /></p><span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p>RAPIDS Accelerator for Apache Spark 0.4</p><p>Spark 3.1.1<br /></p><h1 style="text-align: left;">Solution: <br /></h1><h3 style="text-align: left;">1. Compile RAPIDS Accelerator for Apache Spark<br /></h3><h4 style="text-align: left;">1.a Create a conda env for compiling</h4><pre class="brush:bash; toolbar: false; auto-links: false">conda create -n cudftest -c conda-forge python=3.8 pytest pandas pyarrow sre_yield pytest-xdist findspark</pre><p>Here I decide to use one conda env "cudftest" for compiling and use another conda env named "rapids-0.18" to test the cudf_udf in Spark.<br /></p><p>Of course you can choose to use one conda env if you want but it may include too many python packages in the end. </p><p>I just want to keep the conda env "rapids-0.18" to be as small as possible because eventually I need to distribute it to all Executors in Spark cluster. <br /></p><h4 style="text-align: left;">1.b Compile from source code <br /></h4><pre class="brush:bash; toolbar: false; auto-links: false">cd ~/github/spark-rapids<br /># git checkout v0.4.0<br />mvn clean install -DskipTests</pre><p>You can decide which version to compile. Here I am going to compile the 0.15-snapshot which is the current main branch. The current GA release is 0.4 though.</p><h3 style="text-align: left;">2. Run pandas cudf_udf Tests</h3><p>Please follow <a href="https://github.com/NVIDIA/spark-rapids/tree/branch-0.5/integration_tests#enabling-cudf_udf-tests" rel="nofollow" target="_blank">this Doc</a> on how to enable the pandas cudf_udf tests.<br /></p><p>Basically pandas cudf_udf tests are inside "./integration_tests/runtests.py" with option "--cudf_udf".</p><p>The key is to make sure the all the python envs and needed jar file paths are correct.</p><h4 style="text-align: left;">2.a Create a conda env for running cudf_udf tests</h4><p>Please follow the steps mentioned in <a href="https://rapids.ai/start.html">rapids.ai</a> to create the conda env with cudf installed.<br /></p><p>For example: <br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \<br /> -c defaults cudf=0.18 python=3.7 cudatoolkit=11.0</pre>
<h4 style="text-align: left;">2.b Install needed python packages needed by cudf_udf tests</h4><pre class="brush:bash; toolbar: false; auto-links: false">conda activate rapids-0.18<br />conda install pandas</pre>
<h4 style="text-align: left;">2.c Package your conda env</h4><p>You can refer to <a href="http://alkaline-ml.com/2018-07-02-conda-spark/" rel="nofollow" target="_blank">this blog</a> on how to package your conda env for spark job.<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">cd /home/xxx/miniconda3/envs<br />zip -r rapids-0.18.zip rapids-0.18/<br />mv rapids-0.18.zip ~/<br />cd ~/ && mkdir MYGLOBALENV<br />cd MYGLOBALENV/ && ln -s /home/xxx/miniconda3/envs/rapids-0.18/ rapids-0.18<br />cd ..<br />export PYSPARK_PYTHON=./MYGLOBALENV/rapids-0.18/bin/python</pre>
<h4 style="text-align: left;">2.d Run the pandas cudf_udf tests<br /></h4>
<pre class="brush:bash; toolbar: false; auto-links: false">cd /home/xxx/github/spark-rapids/integration_tests <br />PYSPARK_PYTHON=/home/xxx/MYGLOBALENV/rapids-0.18/bin/python $SPARK_HOME/bin/spark-submit --jars "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/github/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/spark/rapids/cudf.jar,/home/xxx/github/spark-rapids/tests/target/rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" \<br /> --conf spark.rapids.memory.gpu.allocFraction=0.3 \<br /> --conf spark.rapids.python.memory.gpu.allocFraction=0.3 \<br /> --conf spark.rapids.python.concurrentPythonWorkers=2 \<br /> --py-files "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \<br /> --conf spark.executorEnv.PYTHONPATH="/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \<br /> --conf spark.executorEnv.PYSPARK_PYTHON=/home/xxx/rapids-0.18/bin/python \<br /> --archives /home/xxx/rapids-0.18.zip#MYGLOBALENV \<br /> ./runtests.py -m "cudf_udf" -v -rfExXs --cudf_udf </pre>
<p>Note1: Make sure all jar paths are correct.<br /></p><p>Note2: Here I am using spark standalone cluster, that is why I used <b><i>spark.executorEnv.PYSPARK_PYTHON</i></b>. For Spark on YARN, you need to use corresponding parameters such as <b><i>spark.yarn.appMasterEnv.PYSPARK_PYTHON</i></b> .</p><p>Note3: Make sure $SPARK_HOME is set and also the spark cluster is working fine with Rapids for Spark enabled.<br /></p><p>The expected result is: PASSED [100%]. </p><h1 style="text-align: left;">Reference:</h1><p><a href="http://alkaline-ml.com/2018-07-02-conda-spark/" rel="nofollow" target="_blank">http://alkaline-ml.com/2018-07-02-conda-spark/ </a><br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-38124242423532339472021-03-19T23:24:00.002-07:002021-03-20T15:23:55.206-07:00Understanding RAPIDS Accelerator For Apache Spark's supported timezone<h1 style="text-align: left;">Goal:</h1><p>This article explains the current supported timezone for "<a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">RAPIDS Accelerator For Apache Spark</a>".<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>RAPIDS Accelerator For Apache Spark 0.4<br /></p><h1 style="text-align: left;">Concept:</h1><p>As per current <a href="https://nvidia.github.io/spark-rapids/docs/compatibility.html" rel="nofollow" target="_blank">0.4 Doc</a> mentions: <br /></p>
<pre class="brush:text; toolbar: false; auto-links: false">operations involving timestamps will only be GPU-accelerated if the time zone used by the JVM is UTC.</pre>
<p>It means if the JVM timezone of the Spark job is not UTC, the operations involving timestamp will be fallback to CPU which result in performance overhead.</p><p>Here it includes non-supported and supported timestamp format conversion.<br /></p><p><b>Note: supported timestamp formats are documented in this <a href="https://nvidia.github.io/spark-rapids/docs/compatibility.html" rel="nofollow" target="_blank">Compatibility doc</a>. </b><br /></p><h1 style="text-align: left;">Test: <br /></h1><p>Below Spark Cluster nodes are using PST timezone. </p><h3 style="text-align: left;">1. PST JVM timezone + supported timestamp format<br /></h3><p>Let's start a spark-shell without any JVM timezone change and run below timestamp conversion on supported format:</p><pre class="brush:sql; toolbar: false; auto-links: false;highlight: [14,15,16,21,30,31,32]">scala> val df_supported = Seq(("2021-12-25 11:11:11")).toDF("ts")<br />df_supported: org.apache.spark.sql.DataFrame = [ts: string]<br /><br />scala> df_supported.write.format("parquet").mode("overwrite").save("/tmp/testts_supported.parquet")<br />21/03/19 21:58:29 WARN GpuOverrides:<br /> !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found<br /> @Expression <AttributeReference> ts#4 could run on GPU<br /><br /><br />scala> spark.read.parquet("/tmp/testts_supported.parquet").createOrReplaceTempView("df_supported")<br /><br />scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").explain<br />21/03/19 21:58:31 WARN GpuOverrides:<br />!Exec <ProjectExec> cannot run on GPU because unsupported data types in output: TimestampType; not all expressions can be replaced<br /> !Expression <Alias> gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9 cannot run on GPU because expression Alias gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9 produces an unsupported type TimestampType; expression GetTimestamp gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) produces an unsupported type TimestampType<br /> !Expression <GetTimestamp> gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because expression GetTimestamp gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) produces an unsupported type TimestampType<br /> @Expression <AttributeReference> ts#7 could run on GPU<br /> @Expression <Literal> yyyy-MM-dd HH:mm:ss could run on GPU<br /><br />== Physical Plan ==<br />*(1) Project [gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9]<br />+- GpuColumnarToRow false<br /> +- GpuFileGpuScan parquet [ts#7] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testts_supported.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ts:string><br /><br /><br /><br />scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").show<br />21/03/19 21:58:31 WARN GpuOverrides:<br /> !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced<br /> @Expression <Alias> cast(gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) as string) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#14 could run on GPU<br /> @Expression <Cast> cast(gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) as string) could run on GPU<br /> !Expression <GetTimestamp> gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because expression GetTimestamp gettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, Some(UTC), false) produces an unsupported type TimestampType<br /> @Expression <AttributeReference> ts#7 could run on GPU<br /> @Expression <Literal> yyyy-MM-dd HH:mm:ss could run on GPU<br /><br />+-------------------------------------+<br />|to_timestamp(ts, yyyy-MM-dd HH:mm:ss)|<br />+-------------------------------------+<br />| 2021-12-25 11:11:11|<br />+-------------------------------------+</pre>
<p>As you can see above, the operation "to_timestamp" fallback to CPU mode with the keyword in the query plan -- "<span style="color: red;">Project</span>".</p><p>From Spark UI's query plan, we can see "GpuColumnarToRow" and "GpuRowToColumnar".</p><p>This indicates performance overhead since data is moved between GPU and CPU:<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAZcjnhHstW1-KFrG5lyQhl-Zm6rF0U6mldv1JPr2pNt5vbJIctNBkgszcvuC5BKMNFtGtsc5vZkNEUio_zc7V6umKF3oxx_GcO5ZhNbJggV_EpHMyXEWR7oH1rrxk4_BfkQXmVrzxKk0/s492/Screen+Shot+2021-03-19+at+11.08.39+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="492" data-original-width="234" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAZcjnhHstW1-KFrG5lyQhl-Zm6rF0U6mldv1JPr2pNt5vbJIctNBkgszcvuC5BKMNFtGtsc5vZkNEUio_zc7V6umKF3oxx_GcO5ZhNbJggV_EpHMyXEWR7oH1rrxk4_BfkQXmVrzxKk0/w304-h640/Screen+Shot+2021-03-19+at+11.08.39+PM.png" width="304" /></a></div><p></p><h3 style="text-align: left;">2. UTC JVM timezone + supported timestamp format</h3><p>To make supported timestamp operation work, we do not need to change the timezone of the machines if the machine timezone is not UTC.<br /></p><p>We just need to change the JVM timezone for driver and executor.</p><p>The method is described in <a href="https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/README.md" rel="nofollow" target="_blank">this Doc</a>:</p><ul style="text-align: left;"><li>spark.driver.extraJavaOptions should include -Duser.timezone=UTC</li><li>spark.executor.extraJavaOptions should include -Duser.timezone=UTC</li><li>spark.sql.session.timeZone=UTC <br /></li></ul><p>Then run the same tests in spark-shell after changing JVM timezone to UTC: <br /></p><pre class="brush:text; toolbar: false; auto-links: false">spark-shell --conf spark.sql.session.timeZone=UTC --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"</pre>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: 15">scala> val df_supported = Seq(("2021-12-25 11:11:11")).toDF("ts")<br />df_supported: org.apache.spark.sql.DataFrame = [ts: string]<br /><br />scala> df_supported.write.format("parquet").mode("overwrite").save("/tmp/testts_supported.parquet")<br />21/03/20 06:11:56 WARN GpuOverrides:<br /> !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found<br /> @Expression <AttributeReference> ts#4 could run on GPU<br /><br /><br />scala> spark.read.parquet("/tmp/testts_supported.parquet").createOrReplaceTempView("df_supported")<br /><br />scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").explain<br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [gpugettimestamp(ts#7, yyyy-MM-dd HH:mm:ss, yyyy-MM-dd HH:mm:ss, %Y-%m-%d %H:%M:%S, None) AS to_timestamp(ts, yyyy-MM-dd HH:mm:ss)#9]<br /> +- GpuFileGpuScan parquet [ts#7] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testts_supported.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ts:string><br /><br /><br /><br />scala> spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").show<br />+-------------------------------------+<br />|to_timestamp(ts, yyyy-MM-dd HH:mm:ss)|<br />+-------------------------------------+<br />| 2021-12-25 11:11:11|<br />+-------------------------------------+</pre>
<p>As you can see above, the operation "to_timestamp" now runs in GPU mode with the keyword in the query plan -- "<span style="color: red;">GpuProject</span>". Spark UI shows the same:<br /></p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYb1A7VBJu7jrC4Nbnx-OI7G7C6CNMQf_02mMtU7A2GvcssyuyNuuwoYTO72kmQ7dSliE4bHnL7hCVx21-EYpvhObmUJpm5RnedfC5esLoHOde4gNmQ-6gFKwepMQLVVXibJ6OouOItbc/s140/Screen+Shot+2021-03-19+at+11.14.48+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="140" data-original-width="134" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYb1A7VBJu7jrC4Nbnx-OI7G7C6CNMQf_02mMtU7A2GvcssyuyNuuwoYTO72kmQ7dSliE4bHnL7hCVx21-EYpvhObmUJpm5RnedfC5esLoHOde4gNmQ-6gFKwepMQLVVXibJ6OouOItbc/w306-h320/Screen+Shot+2021-03-19+at+11.14.48+PM.png" width="306" /></a></div><br /><p></p><h3 style="text-align: left;">3. UTC JVM timezone + non-supported timestamp format</h3><p>For non-supported timestamp format, it will still fallback to CPU mode.</p><p>For example: "MMM" is not supported in 0.4. <br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [14,16,29,32]">scala> val df_notsupported = Seq(("2021-Dec-25 11:11:11")).toDF("ts")<br />df_notsupported: org.apache.spark.sql.DataFrame = [ts: string]<br /><br />scala> df_notsupported.write.format("parquet").mode("overwrite").save("/tmp/testts_notsupported.parquet")<br />21/03/20 06:15:49 WARN GpuOverrides:<br /> !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found<br /> @Expression <AttributeReference> ts#22 could run on GPU<br /><br /><br />scala> spark.read.parquet("/tmp/testts_notsupported.parquet").createOrReplaceTempView("df_notsupported")<br /><br />scala> spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").explain<br />21/03/20 06:15:50 WARN GpuOverrides:<br />!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced<br /> @Expression <Alias> gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)#27 could run on GPU<br /> !Expression <GetTimestamp> gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because incompatible format 'yyyy-MMM-dd HH:mm:ss'. Set spark.rapids.sql.incompatibleDateFormats.enabled=true to force onto GPU.<br /> @Expression <AttributeReference> ts#25 could run on GPU<br /> @Expression <Literal> yyyy-MMM-dd HH:mm:ss could run on GPU<br /><br />== Physical Plan ==<br />*(1) Project [gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) AS to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)#27]<br />+- GpuColumnarToRow false<br /> +- GpuFileGpuScan parquet [ts#25] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testts_notsupported.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ts:string><br /><br /><br /><br />scala> spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").show<br />21/03/20 06:15:51 WARN GpuOverrides:<br /> !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced<br /> @Expression <Alias> cast(gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) as string) AS to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)#32 could run on GPU<br /> @Expression <Cast> cast(gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) as string) could run on GPU<br /> !Expression <GetTimestamp> gettimestamp(ts#25, yyyy-MMM-dd HH:mm:ss, Some(UTC), false) cannot run on GPU because incompatible format 'yyyy-MMM-dd HH:mm:ss'. Set spark.rapids.sql.incompatibleDateFormats.enabled=true to force onto GPU.<br /> @Expression <AttributeReference> ts#25 could run on GPU<br /> @Expression <Literal> yyyy-MMM-dd HH:mm:ss could run on GPU<br /><br />+--------------------------------------+<br />|to_timestamp(ts, yyyy-MMM-dd HH:mm:ss)|<br />+--------------------------------------+<br />| 2021-12-25 11:11:11|<br />+--------------------------------------+</pre>
<p> </p><p>Below are test code for pyspark users: <br /></p><pre class="brush:python; toolbar: false; auto-links: false">from pyspark.sql.functions import to_timestamp<br />from pyspark.sql import Row<br />df_supported=sc.parallelize([Row(ts='2021-12-25 11:11:11')]).toDF()<br />df_supported.write.format("parquet").mode("overwrite").save("/tmp/testts_supported.parquet")<br />spark.read.parquet('/tmp/testts_supported.parquet').createOrReplaceTempView("df_supported")<br />spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").explain()<br />spark.sql("select to_timestamp(ts, 'yyyy-MM-dd HH:mm:ss') from df_supported").show()<br /><br />df_notsupported=sc.parallelize([Row(ts='2021-Dec-25 11:11:11')]).toDF()<br />df_notsupported.write.format("parquet").mode("overwrite").save("/tmp/testts_notsupported.parquet")<br />spark.read.parquet('/tmp/testts_notsupported.parquet').createOrReplaceTempView("df_notsupported")<br />spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").explain()<br />spark.sql("select to_timestamp(ts, 'yyyy-MMM-dd HH:mm:ss') from df_notsupported").show()<br /></pre><p>Note: there is one parameter "<b><i>spark.rapids.sql.incompatibleDateFormats.enabled</i></b>" which does below:</p><p>"When parsing strings as dates and timestamps in functions like unix_timestamp, setting this to true will force all parsing onto GPU even for formats that can result in incorrect results when parsing invalid inputs."<br /></p><p> </p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-38409995135123544102021-03-18T22:50:00.003-07:002021-03-18T22:50:55.496-07:00Spark Tuning -- Adaptive Query Execution(3): Dynamically optimizing skew joins<h1 style="text-align: left;">Goal:</h1><p>This article explains Adaptive Query Execution (AQE)'s "Dynamically optimizing skew joins" feature introduced in Spark 3.0. </p>This is a follow up article for <a href="http://www.openkb.info/2021/03/spark-tuning-adaptive-query-execution1.html" target="_blank">Spark Tuning -- Adaptive Query Execution(1): Dynamically coalescing shuffle partitions</a>, and <a href="http://www.openkb.info/2021/03/spark-tuning-adaptive-query-execution2.html" rel="nofollow" target="_blank">Spark Tuning -- Adaptive Query Execution(2): Dynamically switching join strategies</a>.<span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p>Spark 3.0.2 <br /></p><h1 style="text-align: left;">Concept:</h1><p>This article focuses on 3rd feature "Dynamically optimizing skew joins" in AQE.</p>As <a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" rel="nofollow" target="_blank">Spark Performance Tuning guide</a> described: <p>This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. </p><p>Below picture from <a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">databricks blog</a> describes well:<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjW5tIi-lPgUg_oT2VBj5Upsouo1U-5CXJT0X6BLWaqS8OmiOlw8RQERs7okn8bwN2AOhJQaBVHCjCqZyqmbWbJdY3zVG-Wa0n3NNhk0Rkte6BjjKhIMbYX3qStWUI9v8yhVVVuyZkUmOg/s1194/blog-adaptive-query-execution-6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="476" data-original-width="1194" height="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjW5tIi-lPgUg_oT2VBj5Upsouo1U-5CXJT0X6BLWaqS8OmiOlw8RQERs7okn8bwN2AOhJQaBVHCjCqZyqmbWbJdY3zVG-Wa0n3NNhk0Rkte6BjjKhIMbYX3qStWUI9v8yhVVVuyZkUmOg/w640-h256/blog-adaptive-query-execution-6.png" width="640" /></a></div>Below 2 parameters determines a "skew partition". It has to meet both of below 2 conditions:<p></p><ul style="text-align: left;"><li>a. Its partition size > <b><i>spark.sql.adaptive.skewJoin.skewedPartitionFactor </i></b>(default=10) * "median partition size"</li><li>b. Its partition size > <b><i>spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes </i></b>(default = 256MB)</li></ul><p>The source code of this feature is inside <a href="https://github.com/apache/spark/blob/v3.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala" rel="nofollow" target="_blank">org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin</a> .</p><p>Before doing below tests, we can enable log4j DEBUG for above java class so that it can help print the sizes of those partitions. For example, we can put below line in log4j.properties:<br /></p>
<pre class="brush:text; toolbar: false; auto-links: false">log4j.logger.org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin=DEBUG</pre>
<div style="text-align: left;">And then ask executor to use this log4j file: <br /></div><pre class="brush:text; toolbar: false; auto-links: false">spark.executor.extraJavaOptions '-Dlog4j.configuration=$SPARK_HOME/conf/log4j.properties'</pre>
<h1 style="text-align: left;">Solution:</h1><p>As per databricks blog "<a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">Adaptive Query Execution: Speeding Up Spark SQL at Runtime</a>", it has a pretty good <a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672" rel="nofollow" target="_blank">demo notebook</a> which I will use for the following tests. The query which contains skew data is:</p>
<pre class="brush:sql; toolbar: false; auto-links: false">use aqe_demo_db;<br /><br />SELECT s_date, sum(s_quantity * i_price) AS total_sales<br />FROM sales<br />JOIN items ON i_item_id = s_item_id<br />GROUP BY s_date<br />ORDER BY total_sales DESC;</pre>
<h3 style="text-align: left;">1. AQE off <br /></h3><p>This is default run without AQE. Query duration is 6.4min in my test lab.</p><p>Because data skew exists in "sales" table with "s_item_id=100"(80% of the data), the default run will result in a long running SortMergeJoin(SMJ). </p><p>One task in the Shuffle Phase will Shuffle Read 5.8GB data while other 199 tasks only read 14.3MB data in average. It also result in huge spilling on disk.<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoEtIE8-2Ty6fBFemyPB_EM8qn5tFNTmH0NtTG5lqVqTG9ku2xhlxEwO78EXGhYrQG580Qxbs1UegBjzcVRC2jc42L7jeG49in6l-pHMWVR6Jkhi4McguepE8Ic6ctBEzzJspvm5ZKYEw/s868/Screen+Shot+2021-03-18+at+9.56.13+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="256" data-original-width="868" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoEtIE8-2Ty6fBFemyPB_EM8qn5tFNTmH0NtTG5lqVqTG9ku2xhlxEwO78EXGhYrQG580Qxbs1UegBjzcVRC2jc42L7jeG49in6l-pHMWVR6Jkhi4McguepE8Ic6ctBEzzJspvm5ZKYEw/w640-h188/Screen+Shot+2021-03-18+at+9.56.13+PM.png" width="640" /></a></div>Spilling monitoring:<p></p>
<pre class="brush:bash; toolbar: false; auto-links: false">$ pwd<br />/tmp/spark-40b20c3b-7f04-4cc4-9134-7d64be53f919/executor-473da0f7-2d70-4565-91e7-ba5f3ea12a8a/blockmgr-78d16f8e-fc7b-4190-a08e-f96f57aabf97<br />$ find . -name *.*<br />.<br />./20/shuffle_2_91_0.index<br />./20/shuffle_2_219_0.data<br />./34/shuffle_2_170_0.index<br />./34/shuffle_2_192_0.index<br />...</pre>
<p>Jstack on executor process also shows:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">"Executor task launch worker for task 90.0 in stage 2.0 (TID 124)" #54 daemon prio=5 os_prio=0 cpu=222822.75ms elapsed=237.59s tid=0x00007f81000d2000 nid=0xe50 runnable [0x00007f8150e18000]<br /> java.lang.Thread.State: RUNNABLE<br /> at net.jpountz.xxhash.XXHashJNI.XXH32_update(Native Method)<br /> at net.jpountz.xxhash.StreamingXXHash32JNI.update(StreamingXXHash32JNI.java:67)<br /> - locked <0x0000000735011230> (a net.jpountz.xxhash.StreamingXXHash32JNI)<br /> at net.jpountz.xxhash.StreamingXXHash32$1.update(StreamingXXHash32.java:119)<br /> at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:206)<br /> at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:176)<br /> at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:260)<br /> at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:136)<br /> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:544)<br /> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:228)<br /> at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:208)<br /> - locked <0x0000000581600ea8> (a org.apache.spark.memory.TaskMemoryManager)<br /> at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:289)<br /> at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:95)<br /> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:361)<br /> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:417)<br /> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)<br /> at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)<br /> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)<br /> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)<br /> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)<br /> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)<br /> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.findNextInnerJoinRows$(Unknown Source)<br /> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source)<br /> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)<br /> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)<br /> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:774)<br /> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)<br /> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)<br /> at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)<br /> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)<br /> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)<br /> at org.apache.spark.scheduler.Task.run(Task.scala:131)<br /> at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)<br /> at org.apache.spark.executor.Executor$TaskRunner$$Lambda$539/0x00000008404f9440.apply(Unknown Source)<br /> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)<br /> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)<br /> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.10/ThreadPoolExecutor.java:1128)<br /> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.10/ThreadPoolExecutor.java:628)<br /> at java.lang.Thread.run(java.base@11.0.10/Thread.java:834)</pre>
<h3 style="text-align: left;"> 2. AQE on<br /></h3>
<pre class="brush:sql; toolbar: false; auto-links: false">set spark.sql.adaptive.enabled=true;</pre>
<p>This is default run with AQE on. Query duration is 2.4min in my test lab.</p><p>Below debug log is printed which shows the skewed partition size is about 6GB, and AQE split it into 30 partitions:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">Optimizing skewed join.<br />Left side partitions size info:<br />median size: 13972650, max size: 6517549080, min size: 13972650, avg size: 46499052<br />Right side partitions size info:<br />median size: 1549072, max size: 1549072, min size: 1549072, avg size: 1549072<br /><br />DEBUG OptimizeSkewedJoin: Left side partition 23 (6 GB) is skewed, split it into 30 parts.<br />DEBUG OptimizeSkewedJoin: number of skewed partitions: left 1, right 0</pre>
<p>Extra "<span style="color: red;">CustomShuffleReader</span>" also shows skew partition information. </p><p>This stage has 81 partitions, which include 51 normal partitions + 30 skewed partitions.</p><p>It means, if AQE did not trigger this skew optimization, the original partition size should be <b>52</b>. (Remember this number -- 52 because it will show up later.)<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoCxykUR5vbxx3qtqKAzekVynZrEB_9bxiBIT5wnoz111bapC-Nx_xxJ6WZjFLkwR8zLmLEIXj1HY152Jeed2_1PVaEUzqgcmUKexN4eovh0ghcfkmBP7yCrMZ4IjKoatMlDdhI3Y1-PY/s1009/Screen+Shot+2021-03-18+at+10.06.59+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1009" data-original-width="975" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoCxykUR5vbxx3qtqKAzekVynZrEB_9bxiBIT5wnoz111bapC-Nx_xxJ6WZjFLkwR8zLmLEIXj1HY152Jeed2_1PVaEUzqgcmUKexN4eovh0ghcfkmBP7yCrMZ4IjKoatMlDdhI3Y1-PY/w618-h640/Screen+Shot+2021-03-18+at+10.06.59+PM.png" width="618" /></a></div><p></p><h3 style="text-align: left;"> 3. AQE on + increased spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes<br /></h3><pre class="brush:sql; toolbar: false; auto-links: false">set spark.sql.adaptive.enabled=true;<br />set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=6517549081;<br /></pre>
<p>Query duration is 6.6min in my test lab. <br /></p><p>Here I am trying to test spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes so I just set it to 1+<above max partition size>. The goal is to not trigger skew join optimization.<br /></p><p>Below debug log is printed which means no skewed partition found.<br /></p><pre class="brush:sql; toolbar: false; auto-links: false">DEBUG OptimizeSkewedJoin:<br />Optimizing skewed join.<br />Left side partitions size info:<br />median size: 13972650, max size: 6517549080, min size: 13972650, avg size: 46499052<br />Right side partitions size info:<br />median size: 1549072, max size: 1549072, min size: 1549072, avg size: 1549072<br /><br />DEBUG OptimizeSkewedJoin: number of skewed partitions: left 0, right 0<br />DEBUG OptimizeSkewedJoin: OptimizeSkewedJoin rule is not applied due to additional shuffles will be introduced.<br /></pre>
<p>Now we can see "<span style="color: red;">CustomShuffleReader</span>" only spawns <b>52</b> partitions from UI:<br /></p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgP21o421xlmPtJiwy3zQ2t_GVXVBvKiOze6Av5ST32YBLKQZijgrZwensZhx0FnSWVt4b5FmUHyiUVc5qZCpM1VyIw1mNoDGXtUdUn1gYjEvPdGjLuwodgk84fda_PJvnxPW1bNeJFles/s917/Screen+Shot+2021-03-18+at+10.16.52+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="635" data-original-width="917" height="444" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgP21o421xlmPtJiwy3zQ2t_GVXVBvKiOze6Av5ST32YBLKQZijgrZwensZhx0FnSWVt4b5FmUHyiUVc5qZCpM1VyIw1mNoDGXtUdUn1gYjEvPdGjLuwodgk84fda_PJvnxPW1bNeJFles/w640-h444/Screen+Shot+2021-03-18+at+10.16.52+PM.png" width="640" /></a></div><p></p><h3 style="text-align: left;"> 4. GPU Mode with AQE off</h3><p>Now let's try the same minimum query using <a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">Rapids for Spark Accelerator</a>(current release 0.3) + Spark. </p><p>Query duration is 26s in my test lab. (Yes only 26s without AQE on!)<br /></p><p>No debug log triggered since AQE is off.</p><p>GPU Mode will trigger GPU version ShuffleHashJoin(SHJ) which is super fast even without AQE:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ42jB0n38bso2d9iACGRwBnJ6FG2pgfcjfEJuSRcrJgTytzs965U_ArNh0u55xMpsk97QjPvrQK05WAnLqk88SzfI_qvQjt7Hd88Se0IpxGUfOOZKSLmzSuPgXZbjbuihJEM8Qq_6eeA/s1064/Screen+Shot+2021-03-18+at+10.36.16+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1064" data-original-width="876" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ42jB0n38bso2d9iACGRwBnJ6FG2pgfcjfEJuSRcrJgTytzs965U_ArNh0u55xMpsk97QjPvrQK05WAnLqk88SzfI_qvQjt7Hd88Se0IpxGUfOOZKSLmzSuPgXZbjbuihJEM8Qq_6eeA/w526-h640/Screen+Shot+2021-03-18+at+10.36.16+PM.png" width="526" /></a></div><p>There are only 2 partitions/tasks for shuffle stage. </p><p>From the Stage-20 metrics below we can see even though there is huge data skew, the skewed task only took 15s to compete. Thanks to Apache Arrow columnar memory format. <br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnXaW_vklQp3-Cy2yCkt02YbqU2NTz3B6-RgXe2Dti6rEmWRRDVeNpfHUDW7CKHyZTALrfyJi_UxCSIdGWA66_YKT5ti8wEJWCVIErOKBXVcbtfOkvCf-2iMXyn8VO2dZcGA-sRkSabvs/s699/Screen+Shot+2021-03-18+at+10.36.53+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="251" data-original-width="699" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnXaW_vklQp3-Cy2yCkt02YbqU2NTz3B6-RgXe2Dti6rEmWRRDVeNpfHUDW7CKHyZTALrfyJi_UxCSIdGWA66_YKT5ti8wEJWCVIErOKBXVcbtfOkvCf-2iMXyn8VO2dZcGA-sRkSabvs/w640-h230/Screen+Shot+2021-03-18+at+10.36.53+PM.png" width="640" /></a></div><h3 style="text-align: left;">5. GPU Mode with AQE on<br /></h3><pre class="brush:sql; toolbar: false; auto-links: false">set spark.sql.adaptive.enabled = true;</pre><p>Query duration is 25s in my test lab. <br /></p><p>Below debug log shows smaller partition sizes under GPU mode comparing to CPU mode:<br /></p>
<pre class="brush:bash; toolbar: false; auto-links: false">DEBUG OptimizeSkewedJoin:<br />Optimizing skewed join.<br />Left side partitions size info:<br />median size: 3645779055, max size: 6266874120, min size: 1024683990, avg size: 3645779055<br />Right side partitions size info:<br />median size: 112912836, max size: 112912836, min size: 112912836, avg size: 112912836<br /><br />DEBUG OptimizeSkewedJoin: number of skewed partitions: left 0, right 0<br />DEBUG OptimizeSkewedJoin: OptimizeSkewedJoin rule is not applied due to additional shuffles will be introduced.</pre>
<p>Here is because GPU mode does not have SMJ implemented yet as of today. So this AQE feature can not apply here. That is why you see no skewed partition found and it is still using GPU version ShuffleHashJoin.<br /></p><p>However the query plan is a little different here, and AQE does spawns 2 extra "<span style="color: red;">GpuCustomShuffleReader</span>":<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdXZQEh7-5TQnr_AdIvQHcNeIuqHYLIfJz0PUig2Yvjr9XUkIphDz4b0qqwHwC8mtnO2ZSs85wpjgFUaFQ2g9VgBjxZcWBzSySnxD6SMHYKAH6ZgzWQJi6aKS7wnEhlDh33ddiwTohWsw/s919/Screen+Shot+2021-03-18+at+10.23.47+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="919" data-original-width="815" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdXZQEh7-5TQnr_AdIvQHcNeIuqHYLIfJz0PUig2Yvjr9XUkIphDz4b0qqwHwC8mtnO2ZSs85wpjgFUaFQ2g9VgBjxZcWBzSySnxD6SMHYKAH6ZgzWQJi6aKS7wnEhlDh33ddiwTohWsw/w568-h640/Screen+Shot+2021-03-18+at+10.23.47+PM.png" width="568" /></a></div><h1 style="text-align: left;">Reference:</h1><ul><li><a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672" rel="nofollow" target="_blank">https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672</a> </li><li><a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html</a> </li><li><a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" rel="nofollow" target="_blank">https://spark.apache.org/docs/latest/sql-performance-tuning.html</a> <br /></li></ul><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-19559343512892058792021-03-18T17:18:00.005-07:002021-03-18T17:20:15.389-07:00What Dataset API is not supported for RAPIDS Accelerator for Apache Spark<h1 style="text-align: left;">Goal:</h1><div style="text-align: left;">This article explains what Dataset API is not supported for RAPIDS Accelerator for Apache Spark.<span><a name='more'></a></span></div><h1 style="text-align: left;">Env:</h1><div style="text-align: left;">Spark 3.0.2</div><div style="text-align: left;">RAPIDS Accelerator for Apache Spark 0.3</div><h1 style="text-align: left;">Solution: </h1><div style="text-align: left;">Currently <a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">RAPIDS Accelerator for Apache Spark</a> does not support Dataset API but does support Dataframe API.</div><div style="text-align: left;">As we know, basically Dataframe is Dataset[ROW], then what does it mean? </div><div style="text-align: left;">In general the difference is that Dataset API can provide type-safety at compile time and also typed JVM objects comparing to Dataframe API.<br /></div><div style="text-align: left;">If you are leveraging Dataset API's compile time error check feature, the operator may not be able to run on GPU.<br /></div><div style="text-align: left;">Here is one easy example in spark-shell using scala:<br /></div><div style="text-align: left;"><h4><b>1. Create a sample Dataset</b></h4></div>
<pre class="brush:sql; toolbar: false; auto-links: false">import org.apache.spark.sql.Dataset<br />case class customer (<br /> c_customer_sk: Int,<br /> c_customer_id: String,<br /> c_current_cdemo_sk: Int,<br /> c_current_hdemo_sk: Int,<br /> c_current_addr_sk: Int<br />)<br /><br />val df=spark.sql("select c_customer_sk,c_customer_id,c_current_cdemo_sk,c_current_hdemo_sk,c_current_addr_sk from tpcds.customer limit 10")<br />val ds: Dataset[customer] = df.as[customer]</pre>
<div style="text-align: left;"><h4 style="text-align: left;">2. Working on GPU</h4></div>
<pre class="brush:xml; toolbar: false; auto-links: false;highlight: 4">scala> ds.filter($"c_customer_sk" > 0).explain<br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuFilter (gpuisnotnull(c_customer_sk#0) AND (c_customer_sk#0 > 0))<br /> +- GpuGlobalLimit 10<br /> +- GpuShuffleCoalesce 2147483647<br /> +- GpuColumnarExchange gpusinglepartitioning(), ENSURE_REQUIREMENTS, [id=#244]<br /> +- GpuLocalLimit 10<br /> +- GpuFileGpuScan parquet tpcds.customer[c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/tpcds_100G_parquet/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...</pre>
<div style="text-align: left;">Here we specify the exact column name and as you can see Filter is running on GPU.</div><div style="text-align: left;"><h3 style="text-align: left;">3. Not working on GPU</h3></div>
<pre class="brush:xml; toolbar: false; auto-links: false;highlight: 3">scala> ds.filter(_.c_customer_sk > 0).explain<br />== Physical Plan ==<br />*(1) Filter $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3763/0x00000008415be040@75f92fdc.apply<br />+- GpuColumnarToRow false<br /> +- GpuGlobalLimit 10<br /> +- GpuShuffleCoalesce 2147483647<br /> +- GpuColumnarExchange gpusinglepartitioning(), ENSURE_REQUIREMENTS, [id=#201]<br /> +- GpuLocalLimit 10<br /> +- GpuFileGpuScan parquet tpcds.customer[c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/tpcds_100G_parquet/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...</pre>
<div style="text-align: left;">Here we are trying to access the column inside the typed JVM object at compile time, so the Filter can not run on GPU. </div><div style="text-align: left;">Above Filter is actually an opaque Lamda function in Catalyst plan. <br /></div><div style="text-align: left;">But other operators like FileScan is running on GPU.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">If we set <b><i>spark.rapids.sql.explain</i></b>=NOT_ON_GPU we can see the reasons:<br /></div><pre class="brush:text; toolbar: false; auto-links: false">!Exec <FilterExec> cannot run on GPU because not all expressions can be replaced<br /> !NOT_FOUND <Invoke> $line18.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3146/0x000000084137e840@f053608.apply cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.Invoke could be found<br /> !Expression <Literal> $line18.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3146/0x000000084137e840@f053608 cannot run on GPU because expression Literal $line18.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$3146/0x000000084137e840@f053608 produces an unsupported type ObjectType(interface scala.Function1)<br /> !NOT_FOUND <NewInstance> newInstance(class $line15.$read$$iw$$iw$customer) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.NewInstance could be found<br /> !NOT_FOUND <AssertNotNull> assertnotnull(c_customer_sk#0) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found<br /> @Expression <AttributeReference> c_customer_sk#0 could run on GPU<br /> !NOT_FOUND <Invoke> c_customer_id#1.toString cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.Invoke could be found<br /> @Expression <AttributeReference> c_customer_id#1 could run on GPU<br /> !NOT_FOUND <AssertNotNull> assertnotnull(c_current_cdemo_sk#2) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found<br /> @Expression <AttributeReference> c_current_cdemo_sk#2 could run on GPU<br /> !NOT_FOUND <AssertNotNull> assertnotnull(c_current_hdemo_sk#3) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found<br /> @Expression <AttributeReference> c_current_hdemo_sk#3 could run on GPU<br /> !NOT_FOUND <AssertNotNull> assertnotnull(c_current_addr_sk#4) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull could be found<br /> @Expression <AttributeReference> c_current_addr_sk#4 could run on GPU</pre><h1 style="text-align: left;"><br /></h1>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-77093069713622073552021-03-17T22:49:00.005-07:002021-03-18T21:33:58.407-07:00Spark Tuning -- Adaptive Query Execution(2): Dynamically switching join strategies<h1 style="text-align: left;">Goal:</h1><p>This article explains Adaptive Query Execution (AQE)'s "Dynamically switching join strategies" feature introduced in Spark 3.0. </p><p>This is a follow up article for <a href="http://www.openkb.info/2021/03/spark-tuning-adaptive-query-execution1.html" target="_blank">Spark Tuning -- Adaptive Query Execution(1): Dynamically coalescing shuffle partitions</a>.<br /></p><span><a name='more'></a></span><h1 style="text-align: left;">Env:</h1><p>Spark 3.0.2 <br /></p><h1 style="text-align: left;">Concept: <br /></h1><p>This article focuses on 2nd feature "Dynamically switching join strategies" in AQE.</p><p>As <a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" rel="nofollow" target="_blank">Spark Performance Tuning guide</a> described: <br /></p><p>AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. </p><p>This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if <b><i>spark.sql.adaptive.localShuffleReader.enabled</i></b> is true) <br /></p><h1 style="text-align: left;">Solution:</h1><p>As per databricks blog "<a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">Adaptive Query Execution: Speeding Up Spark SQL at Runtime</a>", it has a pretty good <a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672" rel="nofollow" target="_blank">demo notebook</a> which I will use for the following tests. <br /></p><h3 style="text-align: left;">1. AQE off (default)<br /></h3>
<pre class="brush:sql; toolbar: false; auto-links: false">EXPLAIN cost<br />SELECT s_date, sum(s_quantity * i_price) AS total_sales<br />FROM sales<br />JOIN items ON s_item_id = i_item_id<br />WHERE i_price < 10<br />GROUP BY s_date<br />ORDER BY total_sales DESC;</pre>
<p>The explain plan:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [8,18]">== Optimized Logical Plan ==<br />Sort [total_sales#10L DESC NULLS LAST], true, Statistics(sizeInBytes=368.1 PiB)<br />+- Aggregate [s_date#18], [s_date#18, sum(cast((s_quantity#17 * i_price#20) as bigint)) AS total_sales#10L], Statistics(sizeInBytes=368.1 PiB)<br /> +- Project [s_quantity#17, s_date#18, i_price#20], Statistics(sizeInBytes=368.1 PiB)<br /> +- Join Inner, (cast(s_item_id#16 as bigint) = i_item_id#19L), Statistics(sizeInBytes=589.0 PiB)<br /> :- Filter isnotnull(s_item_id#16), Statistics(sizeInBytes=3.7 GiB)<br /> : +- Relation[s_item_id#16,s_quantity#17,s_date#18] parquet, Statistics(sizeInBytes=3.7 GiB)<br /> +- Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L)), Statistics(sizeInBytes=157.6 MiB)<br /> +- Relation[i_item_id#19L,i_price#20] parquet, Statistics(sizeInBytes=157.6 MiB)<br /><br />== Physical Plan ==<br />*(7) Sort [total_sales#10L DESC NULLS LAST], true, 0<br />+- Exchange rangepartitioning(total_sales#10L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#109]<br /> +- *(6) HashAggregate(keys=[s_date#18], functions=[sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, total_sales#10L])<br /> +- Exchange hashpartitioning(s_date#18, 200), ENSURE_REQUIREMENTS, [id=#105]<br /> +- *(5) HashAggregate(keys=[s_date#18], functions=[partial_sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, sum#24L])<br /> +- *(5) Project [s_quantity#17, s_date#18, i_price#20]<br /> +- *(5) SortMergeJoin [cast(s_item_id#16 as bigint)], [i_item_id#19L], Inner<br /> :- *(2) Sort [cast(s_item_id#16 as bigint) ASC NULLS FIRST], false, 0<br /> : +- Exchange hashpartitioning(cast(s_item_id#16 as bigint), 200), ENSURE_REQUIREMENTS, [id=#87]<br /> : +- *(1) Filter isnotnull(s_item_id#16)<br /> : +- *(1) ColumnarToRow<br /> : +- FileScan parquet aqe_demo_db.sales[s_item_id#16,s_quantity#17,s_date#18] Batched: true, DataFilters: [isnotnull(s_item_id#16)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [IsNotNull(s_item_id)], ReadSchema: struct<s_item_id:int,s_quantity:int,s_date:date><br /> +- *(4) Sort [i_item_id#19L ASC NULLS FIRST], false, 0<br /> +- Exchange hashpartitioning(i_item_id#19L, 200), ENSURE_REQUIREMENTS, [id=#96]<br /> +- *(3) Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L))<br /> +- *(3) ColumnarToRow<br /> +- FileScan parquet aqe_demo_db.items[i_item_id#19L,i_price#20] Batched: true, DataFilters: [isnotnull(i_price#20), (i_price#20 < 10), isnotnull(i_item_id#19L)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/items], PartitionFilters: [], PushedFilters: [IsNotNull(i_price), LessThan(i_price,10), IsNotNull(i_item_id)], ReadSchema: struct<i_item_id:bigint,i_price:int></pre>
<p> From the "Optimized Logical Plan", the estimated size of smaller side table "items" after filter "i_price<10" is 157.6MB which is larger than the default <b><i>spark.sql.autoBroadcastJoinThreshold </i></b>(10MB). As a result, a Sort Merge Join(SMJ) is chosen.</p><p>When we check the Spark UI after the query finishes, we found out that the actual size of the "smaller" join side is only 6.9MB which means the estimation is not very accurate:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn13E3OtSthhHswMTfJCJdUyuYNylyRw2T4rR529si5E7T_pbst9QfLFjNvlj49O8M1ELGKS1ytWTeRxDpbHGZsPLwFWoBwQwPy9y42yLKjAYSKs77QrNn9VMJlWCCekAKNv16JS3yz1M/s869/Screen+Shot+2021-03-17+at+10.08.30+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="819" data-original-width="869" height="604" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn13E3OtSthhHswMTfJCJdUyuYNylyRw2T4rR529si5E7T_pbst9QfLFjNvlj49O8M1ELGKS1ytWTeRxDpbHGZsPLwFWoBwQwPy9y42yLKjAYSKs77QrNn9VMJlWCCekAKNv16JS3yz1M/w640-h604/Screen+Shot+2021-03-17+at+10.08.30+PM.png" width="640" /></a></div>As we know, normally the best performant join type is Broadcast Hash Join(BHJ) if one side is small enough to be broadcasted. <p></p><p>In this case, how can we let Spark be smart enough to change the plan to BHJ from SMJ at runtime? AQE is here to help us.<br /></p><h3 style="text-align: left;">2. AQE on<br /></h3>
<pre class="brush:sql; toolbar: false; auto-links: false">set spark.sql.adaptive.enabled=true;</pre>
<p>After AQE is turned on, the query plan would not change a lot except a sign "AdaptiveSparkPlan":<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [8,12,19]">== Optimized Logical Plan ==<br />Sort [total_sales#35L DESC NULLS LAST], true, Statistics(sizeInBytes=368.1 PiB)<br />+- Aggregate [s_date#18], [s_date#18, sum(cast((s_quantity#17 * i_price#20) as bigint)) AS total_sales#35L], Statistics(sizeInBytes=368.1 PiB)<br /> +- Project [s_quantity#17, s_date#18, i_price#20], Statistics(sizeInBytes=368.1 PiB)<br /> +- Join Inner, (cast(s_item_id#16 as bigint) = i_item_id#19L), Statistics(sizeInBytes=589.0 PiB)<br /> :- Filter isnotnull(s_item_id#16), Statistics(sizeInBytes=3.7 GiB)<br /> : +- Relation[s_item_id#16,s_quantity#17,s_date#18] parquet, Statistics(sizeInBytes=3.7 GiB)<br /> +- Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L)), Statistics(sizeInBytes=157.6 MiB)<br /> +- Relation[i_item_id#19L,i_price#20] parquet, Statistics(sizeInBytes=157.6 MiB)<br /><br />== Physical Plan ==<br />AdaptiveSparkPlan isFinalPlan=false<br />+- Sort [total_sales#35L DESC NULLS LAST], true, 0<br /> +- Exchange rangepartitioning(total_sales#35L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#177]<br /> +- HashAggregate(keys=[s_date#18], functions=[sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, total_sales#35L])<br /> +- Exchange hashpartitioning(s_date#18, 200), ENSURE_REQUIREMENTS, [id=#174]<br /> +- HashAggregate(keys=[s_date#18], functions=[partial_sum(cast((s_quantity#17 * i_price#20) as bigint))], output=[s_date#18, sum#44L])<br /> +- Project [s_quantity#17, s_date#18, i_price#20]<br /> +- SortMergeJoin [cast(s_item_id#16 as bigint)], [i_item_id#19L], Inner<br /> :- Sort [cast(s_item_id#16 as bigint) ASC NULLS FIRST], false, 0<br /> : +- Exchange hashpartitioning(cast(s_item_id#16 as bigint), 200), ENSURE_REQUIREMENTS, [id=#166]<br /> : +- Filter isnotnull(s_item_id#16)<br /> : +- FileScan parquet aqe_demo_db.sales[s_item_id#16,s_quantity#17,s_date#18] Batched: true, DataFilters: [isnotnull(s_item_id#16)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [IsNotNull(s_item_id)], ReadSchema: struct<s_item_id:int,s_quantity:int,s_date:date><br /> +- Sort [i_item_id#19L ASC NULLS FIRST], false, 0<br /> +- Exchange hashpartitioning(i_item_id#19L, 200), ENSURE_REQUIREMENTS, [id=#167]<br /> +- Filter ((isnotnull(i_price#20) AND (i_price#20 < 10)) AND isnotnull(i_item_id#19L))<br /> +- FileScan parquet aqe_demo_db.items[i_item_id#19L,i_price#20] Batched: true, DataFilters: [isnotnull(i_price#20), (i_price#20 < 10), isnotnull(i_item_id#19L)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/items], PartitionFilters: [], PushedFilters: [IsNotNull(i_price), LessThan(i_price,10), IsNotNull(i_item_id)], ReadSchema: struct<i_item_id:bigint,i_price:int></pre>
<p>When the query is running, if we check the UI, it initially still shows SMJ.</p><p>As I mentioned in this post <a href="http://www.openkb.info/2021/02/spark-tuning-explaining-spark-sql-join.html" target="_blank">Spark Tuning -- explaining Spark SQL Join Types</a>, SMJ actually has 3 steps -- shuffle, sort and merge. </p><p>So after shuffling is done, Spark realized that the smaller side of the join is actually 6.9MB which is smaller than default <b><i>spark.sql.autoBroadcastJoinThreshold </i></b>(10MB). As a result, AQE tells Spark to change the plan from SMJ to BHJ at runtime. </p><p>Since the shuffle is done already(otherwise, Spark won't know the real size of the smaller side), this is why the tuning guide says "This is not as efficient as planning a broadcast hash join in the first place".</p><p>But anyway, it avoids the rest steps of SMJ -- sort and merge, so it should still be better than a complete SMJ. </p><p>Since the shuffle writes is done, but the the rest steps are just a BHJ. Spark is smart enough to fetch the data from those shuffle files using a "local mode" since <b><i>spark.sql.adaptive.localShuffleReader.enabled</i></b> is true by default.</p><p>So from UI, you would find extra "<span style="color: red;">CustomShuffleReader</span>"s which are local mode to avoid network traffic:<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDMkvZHrxwdsFgqWdZIgAOKmnJq-fWnH2lClMMhUJHVYePcm7fB8CDt7yB13BepwJ0j-9gFRvfYx_lFQn6jUT_1e6gxClaoaz6N07EF6zNJDYa170im8AyF1Ee4KvXSD5-mtA0ourehlM/s756/Screen+Shot+2021-03-17+at+10.20.16+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="668" data-original-width="756" height="566" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDMkvZHrxwdsFgqWdZIgAOKmnJq-fWnH2lClMMhUJHVYePcm7fB8CDt7yB13BepwJ0j-9gFRvfYx_lFQn6jUT_1e6gxClaoaz6N07EF6zNJDYa170im8AyF1Ee4KvXSD5-mtA0ourehlM/w640-h566/Screen+Shot+2021-03-17+at+10.20.16+PM.png" width="640" /></a></div><br /><p></p><p>Below graph is from <a href="https://dev.to/yaooqinn/how-to-use-spark-adaptive-query-execution-aqe-in-kyuubi-2ek2" rel="nofollow" target="_blank">this blog</a> which explains this local shuffle:<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh722rCbUWZGIhOsr_XW-IOoQg0jfryKH-_GtR8FH_NybBs-Jif3C8erKh-Vvhyphenhyphena0b96wVUJtdIY5bamikoUjUEJtDCvWK9WCsDBPHzO3wKQ_wE_mG8x95Ya6_aRrMQs80suwYNLfuAOKY/s880/a4lh5rl2xtbi6q55ba1o.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="591" data-original-width="880" height="430" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh722rCbUWZGIhOsr_XW-IOoQg0jfryKH-_GtR8FH_NybBs-Jif3C8erKh-Vvhyphenhyphena0b96wVUJtdIY5bamikoUjUEJtDCvWK9WCsDBPHzO3wKQ_wE_mG8x95Ya6_aRrMQs80suwYNLfuAOKY/w640-h430/a4lh5rl2xtbi6q55ba1o.png" width="640" /></a></div>And also # of partitions from local shuffles = #of upstream map tasks. <p></p><p style="text-align: left;">In our case, it is 30 and 4. (I will compare these numbers to next test.)<br /></p><h3 style="text-align: left;">3. AQE on but local shuffle reader is disabled </h3>
<pre class="brush:sql; toolbar: false; auto-links: false">set spark.sql.adaptive.enabled=true;<br />set spark.sql.adaptive.localShuffleReader.enabled=false;</pre>
<p style="text-align: left;"></p><p>This is for testing purpose, and we should not disable local shuffle reader as always.<br /></p><p>The reason why I disable it is to show the shuffle reader statistics differences comparing to #2:<br /></p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfXTDvm-sQHhizJ5N1FyX7JAWc47i6-SwCpL7THbcs5TjOTSS96Kc8-QR_3HTPwOExNVFnsgP3KQqJQsysNHGHJ50IIq17OQb3uEgJ81sv5G3kNxL8rF8s5EnuMvrjvI-w7UY1zEST8Zs/s804/Screen+Shot+2021-03-17+at+10.28.36+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="643" data-original-width="804" height="512" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfXTDvm-sQHhizJ5N1FyX7JAWc47i6-SwCpL7THbcs5TjOTSS96Kc8-QR_3HTPwOExNVFnsgP3KQqJQsysNHGHJ50IIq17OQb3uEgJ81sv5G3kNxL8rF8s5EnuMvrjvI-w7UY1zEST8Zs/w640-h512/Screen+Shot+2021-03-17+at+10.28.36+PM.png" width="640" /></a></div>Now it is shown as "<span style="color: red;">CustomShuffleReader coalesced</span>".<p></p><p>And also the # of partition changed to 52 and 5 from 30 and 4.<br /></p><h3 style="text-align: left;">4. GPU Mode with AQE on <br /></h3><p>Now let's try the same minimum query using <a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">Rapids for Spark Accelerator</a>(current release 0.3) + Spark to see what is the query plan under GPU.</p><p>Explain plan output looks as CPU plan, but do not worry, the actual plan is GPU plan:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [8,12,19]">== Optimized Logical Plan ==<br />Sort [total_sales#20L DESC NULLS LAST], true, Statistics(sizeInBytes=368.1 PiB)<br />+- Aggregate [s_date#28], [s_date#28, sum(cast((s_quantity#27 * i_price#30) as bigint)) AS total_sales#20L], Statistics(sizeInBytes=368.1 PiB)<br /> +- Project [s_quantity#27, s_date#28, i_price#30], Statistics(sizeInBytes=368.1 PiB)<br /> +- Join Inner, (cast(s_item_id#26 as bigint) = i_item_id#29L), Statistics(sizeInBytes=589.0 PiB)<br /> :- Filter isnotnull(s_item_id#26), Statistics(sizeInBytes=3.7 GiB)<br /> : +- Relation[s_item_id#26,s_quantity#27,s_date#28] parquet, Statistics(sizeInBytes=3.7 GiB)<br /> +- Filter ((isnotnull(i_price#30) AND (i_price#30 < 10)) AND isnotnull(i_item_id#29L)), Statistics(sizeInBytes=157.6 MiB)<br /> +- Relation[i_item_id#29L,i_price#30] parquet, Statistics(sizeInBytes=157.6 MiB)<br /><br />== Physical Plan ==<br />AdaptiveSparkPlan isFinalPlan=false<br />+- Sort [total_sales#20L DESC NULLS LAST], true, 0<br /> +- Exchange rangepartitioning(total_sales#20L DESC NULLS LAST, 2), ENSURE_REQUIREMENTS, [id=#73]<br /> +- HashAggregate(keys=[s_date#28], functions=[sum(cast((s_quantity#27 * i_price#30) as bigint))], output=[s_date#28, total_sales#20L])<br /> +- Exchange hashpartitioning(s_date#28, 2), ENSURE_REQUIREMENTS, [id=#70]<br /> +- HashAggregate(keys=[s_date#28], functions=[partial_sum(cast((s_quantity#27 * i_price#30) as bigint))], output=[s_date#28, sum#34L])<br /> +- Project [s_quantity#27, s_date#28, i_price#30]<br /> +- SortMergeJoin [cast(s_item_id#26 as bigint)], [i_item_id#29L], Inner<br /> :- Sort [cast(s_item_id#26 as bigint) ASC NULLS FIRST], false, 0<br /> : +- Exchange hashpartitioning(cast(s_item_id#26 as bigint), 2), ENSURE_REQUIREMENTS, [id=#62]<br /> : +- Filter isnotnull(s_item_id#26)<br /> : +- FileScan parquet aqe_demo_db.sales[s_item_id#26,s_quantity#27,s_date#28] Batched: true, DataFilters: [isnotnull(s_item_id#26)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [IsNotNull(s_item_id)], ReadSchema: struct<s_item_id:int,s_quantity:int,s_date:date><br /> +- Sort [i_item_id#29L ASC NULLS FIRST], false, 0<br /> +- Exchange hashpartitioning(i_item_id#29L, 2), ENSURE_REQUIREMENTS, [id=#63]<br /> +- Filter ((isnotnull(i_price#30) AND (i_price#30 < 10)) AND isnotnull(i_item_id#29L))<br /> +- FileScan parquet aqe_demo_db.items[i_item_id#29L,i_price#30] Batched: true, DataFilters: [isnotnull(i_price#30), (i_price#30 < 10), isnotnull(i_item_id#29L)], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/items], PartitionFilters: [], PushedFilters: [IsNotNull(i_price), LessThan(i_price,10), IsNotNull(i_item_id)], ReadSchema: struct<i_item_id:bigint,i_price:int></pre>
<p>If we actually run this query, here is the actual final plan shown in UI:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTW9yyzPeDpgGUWyIiSjN8lbIGyAoIQcbHe6JMfeW4xNYf47GlIqz7rTflbioiMxpXhQHSvgvjLXu0f1dBDDF80lRsy5pXItERJ_tZkNxv-KkiOG-IULqUGqolhivdjRmyvW_H1g6Dm3Y/s1048/Screen+Shot+2021-03-17+at+10.35.52+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1048" data-original-width="797" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTW9yyzPeDpgGUWyIiSjN8lbIGyAoIQcbHe6JMfeW4xNYf47GlIqz7rTflbioiMxpXhQHSvgvjLXu0f1dBDDF80lRsy5pXItERJ_tZkNxv-KkiOG-IULqUGqolhivdjRmyvW_H1g6Dm3Y/w486-h640/Screen+Shot+2021-03-17+at+10.35.52+PM.png" width="486" /></a></div>The key things to look at here is the "<span style="color: red;">GpuCustomShuffleReader local</span>" and also the # of local shuffle partitions = 30 and 4 which matches the # of upstream map tasks. <p>Note that in GPU mode, all the data size are smaller than CPU mode.</p><p>For example, now the smaller side of join in GPU mode is only 3.4MB now:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjJhCm0luPZPiDIUCmirTWakyWDiBSodi0YfRx5BtbImiNdCPBhZfNxWCnxkHLlkRD-sK4uQbHzp3oNXpkG4_DZd_Ba2uyxEdOptBp6-buetlWwdNaUr9rlQXija1y6JM5TLslHPOmbWI/s390/Screen+Shot+2021-03-17+at+10.44.52+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="381" data-original-width="390" height="626" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjJhCm0luPZPiDIUCmirTWakyWDiBSodi0YfRx5BtbImiNdCPBhZfNxWCnxkHLlkRD-sK4uQbHzp3oNXpkG4_DZd_Ba2uyxEdOptBp6-buetlWwdNaUr9rlQXija1y6JM5TLslHPOmbWI/w640-h626/Screen+Shot+2021-03-17+at+10.44.52+PM.png" width="640" /></a></div><br /><p>It means, we can even set <b><i>spark.sql.autoBroadcastJoinThreshold</i></b>=4194304(4MB), it will still be converted to a BHJ under AQE.</p><p>And the shuffle writes/reads size are also smaller than CPU mode. <br /></p><h1 style="text-align: left;">Reference:</h1><ul><li><a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672" rel="nofollow" target="_blank">https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672</a> </li><li><a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html</a> </li><li><a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" rel="nofollow" target="_blank">https://spark.apache.org/docs/latest/sql-performance-tuning.html</a> </li><li><a href="https://dev.to/yaooqinn/how-to-use-spark-adaptive-query-execution-aqe-in-kyuubi-2ek2" rel="nofollow" target="_blank">https://dev.to/yaooqinn/how-to-use-spark-adaptive-query-execution-aqe-in-kyuubi-2ek2 </a><br /></li></ul><p> <br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-83918258805313539292021-03-16T15:15:00.003-07:002021-03-18T22:53:01.227-07:00Spark Tuning -- Adaptive Query Execution(1): Dynamically coalescing shuffle partitions<h1 style="text-align: left;">Goal:</h1><p>This article explains Adaptive Query Execution (AQE)'s "Dynamically coalescing shuffle partitions" feature introduced in Spark 3.0.</p><p><span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p> Spark 3.0.2<br /></p><h1 style="text-align: left;">Concept:</h1><p>Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default.Spark SQL can use the umbrella configuration of <b><i>spark.sql.adaptive.enabled</i></b> to control whether turn it on/off. </p><p>In AQE on Spark 3.0, there are 3 features as below:<br /></p><ul style="text-align: left;"><li>Dynamically coalescing shuffle partitions</li><li><a href="http://www.openkb.info/2021/03/spark-tuning-adaptive-query-execution2.html" target="_blank">Dynamically switching join strategies</a></li><li><a href="http://www.openkb.info/2021/03/spark-tuning-adaptive-query-execution1_18.html" target="_blank">Dynamically optimizing skew joins </a><br /></li></ul><p>This article focuses on 1st feature "Dynamically coalescing shuffle partitions". <br /></p><p>This feature coalesces the post shuffle partitions based on the map output statistics when both <b><i>spark.sql.adaptive.enabled</i></b> and <b><i>spark.sql.adaptive.coalescePartitions.enabled</i></b> configurations are true. </p><p>In below test, we will change <b><i>spark.sql.adaptive.coalescePartitions.minPartitionNum</i></b> to 1 which controls the minimum number of shuffle partitions after coalescing. If we do not decrease it, its default value is the same as <b><i>spark.sql.shuffle.partitions</i></b>(which is 200 by default).</p><p>Another important setting is <b><i>spark.sql.adaptive.advisoryPartitionSizeInBytes</i></b> (default 64MB) which controls the advisory size in bytes of the shuffle partition during adaptive optimization.<br /></p><p>Please refer to <a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" rel="nofollow" target="_blank">Spark Performance Tuning guide</a> for details on all other related parameters.<br /></p><h1 style="text-align: left;">Solution:</h1><p>As per databricks blog "<a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">Adaptive Query Execution: Speeding Up Spark SQL at Runtime</a>", it has a pretty good <a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672" rel="nofollow" target="_blank">demo notebook</a> which I will use for the following tests.</p><p>I will run below simple group-by query based on the tables created based in above demo instructions in different modes:</p>
<pre class="brush:sql; toolbar: false; auto-links: false">use aqe_demo_db;<br /><br />SELECT s_date, sum(s_quantity) AS q<br />FROM sales<br />GROUP BY s_date<br />ORDER BY q DESC;<br /></pre>
<h3 style="text-align: left;">1. Default settings without AQE<br /></h3><p>Explain plan:</p>
<pre class="brush:sql; toolbar: false; auto-links: false">*(3) Sort [q#10L DESC NULLS LAST], true, 0<br />+- Exchange rangepartitioning(q#10L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#79]<br /> +- *(2) HashAggregate(keys=[s_date#3], functions=[sum(cast(s_quantity#2 as bigint))], output=[s_date#3, q#10L])<br /> +- Exchange hashpartitioning(s_date#3, 200), ENSURE_REQUIREMENTS, [id=#75]<br /> +- *(1) HashAggregate(keys=[s_date#3], functions=[partial_sum(cast(s_quantity#2 as bigint))], output=[s_date#3, sum#19L])<br /> +- *(1) ColumnarToRow<br /> +- FileScan parquet aqe_demo_db.sales[s_quantity#2,s_date#3] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_quantity:int,s_date:date></pre>Let's focus on the 1st pair of HashAggregate and Exchange in which we can examine the shuffle read and shuffle write size for each task. <br /><p>As per UI:<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCd-99Mj2a9q9snDbF6iFoqgIpDxMGg4DuDG0zPNnlTtpVUEkPJvLXrsdtnuxIzCAzMqqy2qrjDYudgtE7Wo7hA54BiJ2x-jtImgq8ORQE1lTDEp8YBLp6A6PeDMSeK58EgUjJBRDK2ro/s794/Screen+Shot+2021-03-16+at+1.14.27+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="644" data-original-width="794" height="520" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCd-99Mj2a9q9snDbF6iFoqgIpDxMGg4DuDG0zPNnlTtpVUEkPJvLXrsdtnuxIzCAzMqqy2qrjDYudgtE7Wo7hA54BiJ2x-jtImgq8ORQE1lTDEp8YBLp6A6PeDMSeK58EgUjJBRDK2ro/w640-h520/Screen+Shot+2021-03-16+at+1.14.27+PM.png" width="640" /></a></div><p></p><p>The shuffle writes per task is around 13KB which is too small for each task to process after that. <br /></p><p>Let's look at stage level metrics for stage 0 and stage 1 as per above UI.<br /></p><p>Stage 0's Shuffle Write Size: Avg 12.9KB , 30 tasks<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtSD-tXDHa1pjGq5i4bVuAVPmg1RuDehkALXoWqaY1BSEbggZ6VQdy54oD7m_LwyiQv-1Xu_QLVFPFlbdSrCN-KtOlIHIBL6oKJbtnkeLUDs5CFNMkzxGLRSPeczHy7Toov_L7oIADCjs/s3502/Screen+Shot+2021-03-16+at+1.17.04+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="422" data-original-width="3502" height="77" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtSD-tXDHa1pjGq5i4bVuAVPmg1RuDehkALXoWqaY1BSEbggZ6VQdy54oD7m_LwyiQv-1Xu_QLVFPFlbdSrCN-KtOlIHIBL6oKJbtnkeLUDs5CFNMkzxGLRSPeczHy7Toov_L7oIADCjs/w640-h77/Screen+Shot+2021-03-16+at+1.17.04+PM.png" width="640" /></a></div><br /> Stage 1's Shuffle Read Size: Avg 2.3KB, 200 tasks<br /><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc_jnICauTrckIudRIXuOo31SwVvQe8fqQCdJPbs85b2HXU1HdWazUQdCCkCjLEXldWpflY3El8pUyOaZeu6p0uCle5jpngpkQNaSep1UswoTr_OAKTvrNJ1ytJtv7buBFJV9H7Zy6uhI/s3486/Screen+Shot+2021-03-16+at+1.19.50+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="358" data-original-width="3486" height="66" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc_jnICauTrckIudRIXuOo31SwVvQe8fqQCdJPbs85b2HXU1HdWazUQdCCkCjLEXldWpflY3El8pUyOaZeu6p0uCle5jpngpkQNaSep1UswoTr_OAKTvrNJ1ytJtv7buBFJV9H7Zy6uhI/w640-h66/Screen+Shot+2021-03-16+at+1.19.50+PM.png" width="640" /></a></div><div style="text-align: left;">Here is the final plan from UI(for comparison later):</div><pre class="brush:sql; toolbar: false; auto-links: false">== Physical Plan ==
* Sort (7)
+- Exchange (6)
+- * HashAggregate (5)
+- Exchange (4)
+- * HashAggregate (3)
+- * ColumnarToRow (2)
+- Scan parquet aqe_demo_db.sales (1)</pre><h3 style="text-align: left;">2. Default settings with AQE on<br /></h3>
<pre class="brush:sql; toolbar: false; auto-links: false">set spark.sql.adaptive.enabled = true;<br />set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1;</pre>
<p>Explain plan:</p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: 2">== Physical Plan ==<br />AdaptiveSparkPlan isFinalPlan=false<br />+- Sort [q#34L DESC NULLS LAST], true, 0<br /> +- Exchange rangepartitioning(q#34L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#119]<br /> +- HashAggregate(keys=[s_date#23], functions=[sum(cast(s_quantity#22 as bigint))], output=[s_date#23, q#34L])<br /> +- Exchange hashpartitioning(s_date#23, 200), ENSURE_REQUIREMENTS, [id=#116]<br /> +- HashAggregate(keys=[s_date#23], functions=[partial_sum(cast(s_quantity#22 as bigint))], output=[s_date#23, sum#43L])<br /> +- FileScan parquet aqe_demo_db.sales[s_quantity#22,s_date#23] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_quantity:int,s_date:date></pre>
<p>Notice here is the keyword "<span style="color: red;">AdaptiveSparkPlan</span>" but as it mentions this is not final plan.<br /></p><p>Let's focus on the 1st pair of HashAggregate and Exchange in which we
can examine the shuffle read and shuffle write size for each task. </p><p>As per UI:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuY_fxFLfR3zJY7MXUTR7cfRtpX2Igu-rqmPIR5_0l9l3p2jobgYUhhYLZ7tFqwSFhZWbLTV8ZQiToAmqvsB04FBRVxf3zTJVdLjvAbGT8qI9g4iRs1aOFQDXfrroRaDJXoAkCSJQ5xc8/s956/Screen+Shot+2021-03-16+at+2.38.52+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="956" data-original-width="930" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuY_fxFLfR3zJY7MXUTR7cfRtpX2Igu-rqmPIR5_0l9l3p2jobgYUhhYLZ7tFqwSFhZWbLTV8ZQiToAmqvsB04FBRVxf3zTJVdLjvAbGT8qI9g4iRs1aOFQDXfrroRaDJXoAkCSJQ5xc8/w622-h640/Screen+Shot+2021-03-16+at+2.38.52+PM.png" width="622" /></a></div>Now there is an extra "<span style="color: red;">CustomShuffleReader</span>" operator which coalesces the partitions to only 1 because the total partition data size is only 400KB.<p></p><p>Let's look at stage level metrics for stage 0 and stage 2 as per above UI.</p><p>Stage 0's Shuffle Write Size: Avg 12.9KB , 30 tasks(no change)</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigE3qKX2ttH6cUbUX9cS4OQcDS5QtO45NrrqC5O9Eqh8NtID5LvVS3bbbUYyqgLZ1FnQfZOC1loVH7y8v-PA9mWPBmIDv24_n_RIWAaudYjyB9fLXPnw0kKj8zVm0LcnQIRkXfo1_mXCE/s3486/Screen+Shot+2021-03-16+at+2.43.04+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="424" data-original-width="3486" height="78" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigE3qKX2ttH6cUbUX9cS4OQcDS5QtO45NrrqC5O9Eqh8NtID5LvVS3bbbUYyqgLZ1FnQfZOC1loVH7y8v-PA9mWPBmIDv24_n_RIWAaudYjyB9fLXPnw0kKj8zVm0LcnQIRkXfo1_mXCE/w640-h78/Screen+Shot+2021-03-16+at+2.43.04+PM.png" width="640" /></a></div> Stage 2's Shuffle Read Size: 386.6KB, 1 task<br /> <div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw8PZFxTLwiaa6z7i2tcBl3pFVOu7xdaRWzj4eOfaec7lV26NPQRbi_k8Cd4bFTBdtsAJXFlF27-bxN5xYsAM2j4kzOx-EWk8EBadG-Xzl8itaT_4wx2mloo-j8ZfLFB-jDDjlCBRJPHs/s3490/Screen+Shot+2021-03-16+at+2.44.35+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="366" data-original-width="3490" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw8PZFxTLwiaa6z7i2tcBl3pFVOu7xdaRWzj4eOfaec7lV26NPQRbi_k8Cd4bFTBdtsAJXFlF27-bxN5xYsAM2j4kzOx-EWk8EBadG-Xzl8itaT_4wx2mloo-j8ZfLFB-jDDjlCBRJPHs/w640-h68/Screen+Shot+2021-03-16+at+2.44.35+PM.png" width="640" /></a></div><p>So basically AQE combines all of the 200 partitions into 1.</p><p>Here is the final plan from UI which shows as below which you can find "<span style="color: red;">CustomShuffleReader</span>" keywords.<br /></p><pre class="brush:sql; toolbar: false; auto-links: false;highlight: [2,5,9]">== Physical Plan ==<br />AdaptiveSparkPlan (12)<br />+- == Final Plan ==<br /> * Sort (11)<br /> +- CustomShuffleReader (10)<br /> +- ShuffleQueryStage (9)<br /> +- Exchange (8)<br /> +- * HashAggregate (7)<br /> +- CustomShuffleReader (6)<br /> +- ShuffleQueryStage (5)<br /> +- Exchange (4)<br /> +- * HashAggregate (3)<br /> +- * ColumnarToRow (2)<br /> +- Scan parquet aqe_demo_db.sales (1)</pre><p></p><h3 style="text-align: left;">3. Modified settings with AQE on</h3>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: 3">set spark.sql.adaptive.enabled = true;<br />set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1;<br />set spark.sql.adaptive.advisoryPartitionSizeInBytes = 65536;</pre>
<p>Here we just changed <b><i>spark.sql.adaptive.advisoryPartitionSizeInBytes</i></b> from default 64MB to 64KB, so that we can tune the target # of partitions.<br /></p><p>The explain plan is the same as #2. </p><p>The only difference is the # of partitions becomes 7 in Stage 2 now:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieAP3B2bl76XxZO8p4Sox4uua5kgPG-KsVsugUmNwZByZahnBs6c3GgIzuW8KaEKN2Z4Q2L2s53lkE-pmN-Td9GV_eLNZj4r5YFLg-z2eAVOhhsiIVcwVtpeNA7VpHH2FZzQp4kOT3MVE/s3464/Screen+Shot+2021-03-16+at+2.52.58+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="364" data-original-width="3464" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieAP3B2bl76XxZO8p4Sox4uua5kgPG-KsVsugUmNwZByZahnBs6c3GgIzuW8KaEKN2Z4Q2L2s53lkE-pmN-Td9GV_eLNZj4r5YFLg-z2eAVOhhsiIVcwVtpeNA7VpHH2FZzQp4kOT3MVE/w640-h68/Screen+Shot+2021-03-16+at+2.52.58+PM.png" width="640" /></a></div><h3 style="text-align: left;">4. GPU Mode with AQE on(default settings)<br /></h3><p>Now let's try the same minimum query using <a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">Rapids for Spark Accelerator</a>(current release 0.3) + Spark to see what is the query plan under GPU:</p><p>The explain plan may look as normal CPU plan because AQE is on, but actually if you run it, it will show you the correct final plan.</p><p>Explain plan:</p>
<pre class="brush:sql; toolbar: false; auto-links: false">== Physical Plan ==<br />AdaptiveSparkPlan isFinalPlan=false<br />+- Sort [q#20L DESC NULLS LAST], true, 0<br /> +- Exchange rangepartitioning(q#20L DESC NULLS LAST, 2), ENSURE_REQUIREMENTS, [id=#39]<br /> +- HashAggregate(keys=[s_date#28], functions=[sum(cast(s_quantity#27 as bigint))], output=[s_date#28, q#20L])<br /> +- Exchange hashpartitioning(s_date#28, 2), ENSURE_REQUIREMENTS, [id=#36]<br /> +- HashAggregate(keys=[s_date#28], functions=[partial_sum(cast(s_quantity#27 as bigint))], output=[s_date#28, sum#32L])<br /> +- FileScan parquet aqe_demo_db.sales[s_quantity#27,s_date#28] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/xxx/data/warehouse/aqe_demo_db.db/sales], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_quantity:int,s_date:date></pre>
<p>Final Plan from UI:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [2,8,13]">== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
GpuColumnarToRow (14)
+- GpuSort (13)
+- GpuCoalesceBatches (12)
+- GpuShuffleCoalesce (11)
+- GpuCustomShuffleReader (10)
+- ShuffleQueryStage (9)
+- GpuColumnarExchange (8)
+- GpuHashAggregate (7)
+- GpuShuffleCoalesce (6)
+- GpuCustomShuffleReader (5)
+- ShuffleQueryStage (4)
+- GpuColumnarExchange (3)
+- GpuHashAggregate (2)
+- GpuScan parquet aqe_demo_db.sales (1)</pre>
<p>Stage 0's Shuffle Write Size: Avg 3.2KB , 30 tasks(huge decrease due to columnar storage processing)</p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-cMzqJpecJb7gW4UyGeBGn3HK6HmBrQaLYL3hXbULBEpfc7N2s4nwUO9VN1hfCyBpnPHjDyX01cibrkAbxeYHnVkBhUKZTJraHllavMbShWBw2aPkNSBpfbrPchMS0NWquGd8lYZE9oA/s3510/Screen+Shot+2021-03-16+at+3.08.43+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="422" data-original-width="3510" height="76" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-cMzqJpecJb7gW4UyGeBGn3HK6HmBrQaLYL3hXbULBEpfc7N2s4nwUO9VN1hfCyBpnPHjDyX01cibrkAbxeYHnVkBhUKZTJraHllavMbShWBw2aPkNSBpfbrPchMS0NWquGd8lYZE9oA/w640-h76/Screen+Shot+2021-03-16+at+3.08.43+PM.png" width="640" /></a></div> Stage 2's Shuffle Read Size: 97.5KB, 1 task<p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrnnJCcF6MN3o9gR5tzi6J_pXhgB95hRa47C5p4rlMOjDHXlasEbd4a2vDpV9mmDjUJOPkEOsE5xGLK9s14Gips1Ojm4cXTiWQLmchCDOcBC6pbX_iTfnxUTh-OsiubBGrT7X4OgYuHy4/s3508/Screen+Shot+2021-03-16+at+3.10.12+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="370" data-original-width="3508" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrnnJCcF6MN3o9gR5tzi6J_pXhgB95hRa47C5p4rlMOjDHXlasEbd4a2vDpV9mmDjUJOPkEOsE5xGLK9s14Gips1Ojm4cXTiWQLmchCDOcBC6pbX_iTfnxUTh-OsiubBGrT7X4OgYuHy4/w640-h68/Screen+Shot+2021-03-16+at+3.10.12+PM.png" width="640" /></a></div>Basically GPU mode can produce much less shuffle files which result in much less shuffle writes and reads.<br /><p></p><h1 style="text-align: left;">References:</h1><ul style="text-align: left;"><li><a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672" rel="nofollow" target="_blank">https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.133851022.1405204434.1615827502-183867879.1614812672</a> </li><li><a href="https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html" rel="nofollow" target="_blank">https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html</a> </li><li><a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" rel="nofollow" target="_blank">https://spark.apache.org/docs/latest/sql-performance-tuning.html</a><br /></li></ul><p> </p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-61697148615293038112021-03-15T17:36:00.009-07:002021-03-15T22:11:34.866-07:00Spark Tuning -- Dynamic Partition Pruning<h1 style="text-align: left;">Goal:</h1><p>This article explains Dynamic Partition Pruning (DPP) feature introduced in Spark 3.0.<span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>Spark 3.0.2<br /></p><h1 style="text-align: left;">Concept:</h1><p>Dynamic Partition Pruning feature is introduced by <a href="https://issues.apache.org/jira/browse/SPARK-11150" rel="nofollow" target="_blank">SPARK-11150</a> .</p><p>This JIRA also provides a minimal query and its design for example:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0NAhY-hZry4cJ4KdMX55Ema09OR1h3YXba7IV0cjMubKsqF-1jz9Huh3ZkuatyUe0mvDtrB-cOwmU3SweS6YXM4xn4jjAKdKhkk_g88-xHbhSZxf1mGhwZC9YpOJhYQSe4wd48Og7TLI/s1080/Screen+Shot+2021-03-15+at+3.47.58+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="498" data-original-width="1080" height="296" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0NAhY-hZry4cJ4KdMX55Ema09OR1h3YXba7IV0cjMubKsqF-1jz9Huh3ZkuatyUe0mvDtrB-cOwmU3SweS6YXM4xn4jjAKdKhkk_g88-xHbhSZxf1mGhwZC9YpOJhYQSe4wd48Og7TLI/w640-h296/Screen+Shot+2021-03-15+at+3.47.58+PM.png" width="640" /></a></div>Here let's assume: "t1" is a very large fact table with partition key column "pKey", and "t2" is a small dimension table. <p></p><p>Since there is a filter on "t2" -- "t2.id < 2", internally DPP can create a subquery: <br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false">SELECT t2.pKey FROM t2 WHERE t2.id;</pre>
<p>and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1". </p><p>In the meantime, the sub-query result is re-used. See below graph from this <a href="https://www.slideshare.net/databricks/dynamic-partition-pruning-in-apache-spark" rel="nofollow" target="_blank">slides from databricks</a>:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhptcchW8meTpL0UcN_9LFm6ymnz5WaS5KsZZYcO-ZRM5fv7BRk7mH3Lo_UMuD6o3kH3RptxDgcrbeaI7jPa2rm4GwQmcicKLPxaI2BfC1cCagek5eb8GUO6i84cZNFxKa7y_9n6zQOmNs/s2048/Screen+Shot+2021-03-15+at+4.09.04+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1442" data-original-width="2048" height="450" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhptcchW8meTpL0UcN_9LFm6ymnz5WaS5KsZZYcO-ZRM5fv7BRk7mH3Lo_UMuD6o3kH3RptxDgcrbeaI7jPa2rm4GwQmcicKLPxaI2BfC1cCagek5eb8GUO6i84cZNFxKa7y_9n6zQOmNs/w640-h450/Screen+Shot+2021-03-15+at+4.09.04+PM.png" width="640" /></a></div><p>As a result, we can save lots of table scan on the fact table side which brings huge performance gain.</p><p>The parameter to enable or disable DPP is:</p><ul style="text-align: left;"><li><b><i>spark.sql.optimizer.dynamicPartitionPruning.enabled </i></b>(true by default)<br /></li></ul><p>Spark is not the only product using DPP and some other query engines also have this feature such as <a href="https://docs.cloudera.com/runtime/7.2.7/impala-reference/topics/impala-partition-pruning.html" rel="nofollow" target="_blank">impala</a>, <a href="https://issues.apache.org/jira/browse/HIVE-7826" rel="nofollow" target="_blank">Hive on Tez</a>, etc.</p><h1 style="text-align: left;">Solution:</h1><h3 style="text-align: left;">1. CPU mode <br /></h3><p>Here is a simple example query(run in spark-shell) which can help us check if DPP is used or not:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false">spark.range(1000).select(col("id"), col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").save("/tmp/myfact")<br />spark.range(100).select(col("id"), col("id").as("k")).write.format("parquet").mode("overwrite").save("/tmp/mydim")<br />spark.read.parquet("/tmp/myfact").createOrReplaceTempView("fact")<br />spark.read.parquet("/tmp/mydim").createOrReplaceTempView("dim")<br />sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain</pre>
<p>The physical plan is:<br /></p><pre class="brush:sql; toolbar: false; auto-links: false;highlight: [6,7,8]">scala> sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain<br />== Physical Plan ==<br />*(2) Project [id#14L, k#15]<br />+- *(2) BroadcastHashJoin [cast(k#15 as bigint)], [k#19L], Inner, BuildRight<br /> :- *(2) ColumnarToRow<br /> : +- FileScan parquet [id#14L,k#15] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/myfact], PartitionFilters: [isnotnull(k#15), dynamicpruningexpression(cast(k#15 as bigint) IN dynamicpruning#24)], PushedFilters: [], ReadSchema: struct<id:bigint><br /> : +- SubqueryBroadcast dynamicpruning#24, 0, [k#19L], [id=#118]<br /> : +- ReusedExchange [k#19L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#96]<br /> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#96]<br /> +- *(1) Project [k#19L]<br /> +- *(1) Filter ((isnotnull(id#18L) AND (id#18L < 2)) AND isnotnull(k#19L))<br /> +- *(1) ColumnarToRow<br /> +- FileScan parquet [id#18L,k#19L] Batched: true, DataFilters: [isnotnull(id#18L), (id#18L < 2), isnotnull(k#19L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint></pre>
<p> Let's compare it to a plan with DPP disabled:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false">scala> sql("set spark.sql.optimizer.dynamicPartitionPruning.enabled=false")<br />res14: org.apache.spark.sql.DataFrame = [key: string, value: string]<br /><br />scala> sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain<br />== Physical Plan ==<br />*(2) Project [id#35L, k#36]<br />+- *(2) BroadcastHashJoin [cast(k#36 as bigint)], [k#40L], Inner, BuildRight<br /> :- *(2) ColumnarToRow<br /> : +- FileScan parquet [id#35L,k#36] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/myfact], PartitionFilters: [isnotnull(k#36)], PushedFilters: [], ReadSchema: struct<id:bigint><br /> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#288]<br /> +- *(1) Project [k#40L]<br /> +- *(1) Filter ((isnotnull(id#39L) AND (id#39L < 2)) AND isnotnull(k#40L))<br /> +- *(1) ColumnarToRow<br /> +- FileScan parquet [id#39L,k#40L] Batched: true, DataFilters: [isnotnull(id#39L), (id#39L < 2), isnotnull(k#40L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint></pre>
<p> As you can see, when DPP is enabled, we can see the keyword "<span style="color: red;">ReusedExchange</span>" and "<span style="color: red;">SubqueryBroadcast</span>" before scanning the fact table. </p><p>In the fact table scan phase, there is keyword "<span style="color: red;">dynamicpruningexpression</span>".<br /></p><p> If we let the query run with DPP enabled, then we can check the runtime query plan from UI:<br /></p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1HaQNhtFyaTz5cT8KYQuveWWnDj-IGurqS645q2uuHjuZv0C11XahE4ocjT9kCk0btYC2o92yH-4AbrEy3vjvsDG6HxaT-qZmz4aYA5pyKGsjm5XkpFdAI8s2Ior3PtjJJsJn52lzWdA/s1636/Screen+Shot+2021-03-15+at+4.28.51+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1636" data-original-width="702" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1HaQNhtFyaTz5cT8KYQuveWWnDj-IGurqS645q2uuHjuZv0C11XahE4ocjT9kCk0btYC2o92yH-4AbrEy3vjvsDG6HxaT-qZmz4aYA5pyKGsjm5XkpFdAI8s2Ior3PtjJJsJn52lzWdA/w274-h640/Screen+Shot+2021-03-15+at+4.28.51+PM.png" width="274" /></a></div><br /><p></p><p>Here you should notice that the "<span style="color: red;">dynamic partition pruning time: 41 ms</span>" and also the "<span style="color: red;">number of partitions read: 2</span>" which means DPP is taking effect.<br /></p><p>Now let's take a look at a more complex example which is q98 in TPCDS:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false">select i_item_desc, i_category, i_class, i_current_price<br /> ,sum(ss_ext_sales_price) as itemrevenue<br /> ,sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over<br /> (partition by i_class) as revenueratio<br />from<br /> store_sales, item, date_dim<br />where<br /> ss_item_sk = i_item_sk<br /> and i_category in ('Sports', 'Books', 'Home')<br /> and ss_sold_date_sk = d_date_sk<br /> and d_date between cast('1999-02-22' as date)<br /> and (cast('1999-02-22' as date) + interval '30' day)<br />group by<br /> i_item_id, i_item_desc, i_category, i_class, i_current_price<br />order by<br /> i_category, i_class, i_item_id, i_item_desc, revenueratio;</pre>
<p> We just need to focus on fact table "store_sales" joining dimension table "date_dim" based on join key "ss_sold_date_sk = d_date_sk". <br /></p><p>The column "ss_sold_date_sk" is also the partition key for "store_sales".</p><p>"date_dim" has a filter on column "d_date" to only fetch 30 days' data.</p><p>Now the query plan is:</p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [19,20,21]">== Physical Plan ==<br />*(7) Project [i_item_desc#97, i_category#105, i_class#103, i_current_price#98, itemrevenue#159, revenueratio#160]<br />+- *(7) Sort [i_category#105 ASC NULLS FIRST, i_class#103 ASC NULLS FIRST, i_item_id#94 ASC NULLS FIRST, i_item_desc#97 ASC NULLS FIRST, revenueratio#160 ASC NULLS FIRST], true, 0<br /> +- Exchange rangepartitioning(i_category#105 ASC NULLS FIRST, i_class#103 ASC NULLS FIRST, i_item_id#94 ASC NULLS FIRST, i_item_desc#97 ASC NULLS FIRST, revenueratio#160 ASC NULLS FIRST, 20), true, [id=#490]<br /> +- *(6) Project [i_item_desc#97, i_category#105, i_class#103, i_current_price#98, itemrevenue#159, ((_w0#170 * 100.0) / _we0#172) AS revenueratio#160, i_item_id#94]<br /> +- Window [sum(_w1#171) windowspecdefinition(i_class#103, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we0#172], [i_class#103]<br /> +- *(5) Sort [i_class#103 ASC NULLS FIRST], false, 0<br /> +- Exchange hashpartitioning(i_class#103, 20), true, [id=#482]<br /> +- *(4) HashAggregate(keys=[i_item_id#94, i_item_desc#97, i_category#105, i_class#103, i_current_price#98], functions=[sum(ss_ext_sales_price#84)], output=[i_item_desc#97, i_category#105, i_class#103, i_current_price#98, itemrevenue#159, _w0#170, _w1#171, i_item_id#94])<br /> +- Exchange hashpartitioning(i_item_id#94, i_item_desc#97, i_category#105, i_class#103, i_current_price#98, 20), true, [id=#478]<br /> +- *(3) HashAggregate(keys=[i_item_id#94, i_item_desc#97, i_category#105, i_class#103, knownfloatingpointnormalized(normalizenanandzero(i_current_price#98)) AS i_current_price#98], functions=[partial_sum(ss_ext_sales_price#84)], output=[i_item_id#94, i_item_desc#97, i_category#105, i_class#103, i_current_price#98, sum#175])<br /> +- *(3) Project [ss_ext_sales_price#84, i_item_id#94, i_item_desc#97, i_current_price#98, i_class#103, i_category#105]<br /> +- *(3) BroadcastHashJoin [ss_sold_date_sk#92], [d_date_sk#115], Inner, BuildRight<br /> :- *(3) Project [ss_ext_sales_price#84, ss_sold_date_sk#92, i_item_id#94, i_item_desc#97, i_current_price#98, i_class#103, i_category#105]<br /> : +- *(3) BroadcastHashJoin [ss_item_sk#71], [i_item_sk#93], Inner, BuildRight<br /> : :- *(3) Project [ss_item_sk#71, ss_ext_sales_price#84, ss_sold_date_sk#92]<br /> : : +- *(3) Filter isnotnull(ss_item_sk#71)<br /> : : +- *(3) ColumnarToRow<br /> : : +- FileScan parquet tpcds.store_sales[ss_item_sk#71,ss_ext_sales_price#84,ss_sold_date_sk#92] Batched: true, DataFilters: [isnotnull(ss_item_sk#71)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/store_sales/ss_sold_date_sk=24..., PartitionFilters: [isnotnull(ss_sold_date_sk#92), dynamicpruningexpression(ss_sold_date_sk#92 IN dynamicpruning#173)], PushedFilters: [IsNotNull(ss_item_sk)], ReadSchema: struct<ss_item_sk:int,ss_ext_sales_price:double><br /> : : +- SubqueryBroadcast dynamicpruning#173, 0, [d_date_sk#115], [id=#466]<br /> : : +- ReusedExchange [d_date_sk#115], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#426]<br /> : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#417]<br /> : +- *(1) Project [i_item_sk#93, i_item_id#94, i_item_desc#97, i_current_price#98, i_class#103, i_category#105]<br /> : +- *(1) Filter (i_category#105 IN (Sports,Books,Home) AND isnotnull(i_item_sk#93))<br /> : +- *(1) ColumnarToRow<br /> : +- FileScan parquet tpcds.item[i_item_sk#93,i_item_id#94,i_item_desc#97,i_current_price#98,i_class#103,i_category#105] Batched: true, DataFilters: [i_category#105 IN (Sports,Books,Home), isnotnull(i_item_sk#93)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/item], PartitionFilters: [], PushedFilters: [In(i_category, [Sports,Books,Home]), IsNotNull(i_item_sk)], ReadSchema: struct<i_item_sk:int,i_item_id:string,i_item_desc:string,i_current_price:double,i_class:string,i_...<br /> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#426]<br /> +- *(2) Project [d_date_sk#115]<br /> +- *(2) Filter (((isnotnull(d_date#117) AND (d_date#117 >= 10644)) AND (d_date#117 <= 10674)) AND isnotnull(d_date_sk#115))<br /> +- *(2) ColumnarToRow<br /> +- FileScan parquet tpcds.date_dim[d_date_sk#115,d_date#117] Batched: true, DataFilters: [isnotnull(d_date#117), (d_date#117 >= 10644), (d_date#117 <= 10674), isnotnull(d_date_sk#115)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), LessThanOrEqual(d_date,1999-03-24), Is..., ReadSchema: struct<d_date_sk:int,d_date:date></pre>
<p> Key point is:<br /></p>
<pre class="brush:sql; toolbar: false; auto-links: false">: : +- FileScan parquet tpcds.store_sales[ss_item_sk#71,ss_ext_sales_price#84,ss_sold_date_sk#92] Batched: true, DataFilters: [isnotnull(sLocation: InMemoryFileIndex[hdfs://nm:port/data/tpcds_100G_parquet/store_sales/ss_sold_date_sk=24..., PartitionFilters: [isnotnull(ss_sold_date_ss_sold_date_sk#92 IN dynamicpruning#173)], PushedFilters: [IsNotNull(ss_item_sk)], ReadSchema: struct<ss_item_sk:int,ss_ext_sales_price:double><br />: : +- SubqueryBroadcast dynamicpruning#173, 0, [d_date_sk#115], [id=#466]<br />: : +- ReusedExchange [d_date_sk#115], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#426]</pre>
<p>The fact table scan have this DPP enabled in "PartitionFilters: [isnotnull(ss_sold_date_ss_sold_date_sk#92 IN <span style="color: red;">dynamicpruning#173</span>)]".</p><p>"dynamicpruning#173" basically comes from the broadcasted sub-query.<br /></p><h3 style="text-align: left;">2. GPU mode <br /></h3><p>Now let's try the same minimum query using <a href="https://nvidia.github.io/spark-rapids/" rel="nofollow" target="_blank">Rapids for Spark Accelerator</a>(current release 0.3) + Spark to see what is the query plan under GPU:</p>
<pre class="brush:sql; toolbar: false; auto-links: false;highlight: [6,7]">scala> sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").explain<br />== Physical Plan ==<br />GpuColumnarToRow false<br />+- GpuProject [id#0L, k#1]<br /> +- GpuBroadcastHashJoin [cast(k#1 as bigint)], [k#5L], Inner, GpuBuildRight<br /> :- GpuFileGpuScan parquet [id#0L,k#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/myfact], PartitionFilters: [isnotnull(k#1), dynamicpruningexpression(cast(k#1 as bigint) IN dynamicpruning#10)], PushedFilters: [], ReadSchema: struct<id:bigint><br /> : +- SubqueryBroadcast dynamicpruning#10, 0, [k#5L], [id=#51]<br /> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#50]<br /> : +- GpuColumnarToRow false<br /> : +- GpuProject [k#5L]<br /> : +- GpuCoalesceBatches TargetSize(2147483647)<br /> : +- GpuFilter ((gpuisnotnull(id#4L) AND (id#4L < 2)) AND gpuisnotnull(k#5L))<br /> : +- GpuFileGpuScan parquet [id#4L,k#5L] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L < 2), isnotnull(k#5L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint><br /> +- GpuBroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#70]<br /> +- GpuProject [k#5L]<br /> +- GpuCoalesceBatches TargetSize(2147483647)<br /> +- GpuFilter ((gpuisnotnull(id#4L) AND (id#4L < 2)) AND gpuisnotnull(k#5L))<br /> +- GpuFileGpuScan parquet [id#4L,k#5L] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L < 2), isnotnull(k#5L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://nm:port/tmp/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint></pre>
<p>As you can see, the DPP is also happening because when scanning fact table:<br /></p><p>PartitionFilters: [isnotnull(k#1), <span style="color: red;">dynamicpruningexpression</span>(cast(k#1 as bigint) IN dynamicpruning#10)]<br /></p><p>However here we see the sub-query on the dimension table is executed twice. </p><p>This performance overhead should be very minimum since normally the "broadcast side" sub-query should be very lightweight. </p><p>The on-going improvement for DPP will be tracked under <a href="https://github.com/NVIDIA/spark-rapids/issues/386" rel="nofollow" target="_blank">this issue</a>.<br /></p><p> This is why it is also mentioned in current version of <a href="https://nvidia.github.io/spark-rapids/docs/FAQ.html" rel="nofollow" target="_blank">FAQ</a>:<br /></p><p> "Is Dynamic Partition Pruning (DPP) Supported?<br />Yes, DPP still works. It might not be as efficient as it could be, and we are working to improve it."</p><h1 style="text-align: left;">Key Takeaways:</h1><p>DPP is a good feature for star-schema queries.</p><p>It uses <a href="http://www.openkb.info/2021/02/spark-tuning-use-partition-discovery.html" target="_blank">partition runing</a> and <a href="http://www.openkb.info/2021/02/spark-tuning-explaining-spark-sql-join.html" rel="nofollow" target="_blank">broadcast hash join</a> together. </p><p>It currently only supports equi-join.</p><p>The table to prune(fact table) should be partitioned by the join key.<br /></p><h1 style="text-align: left;">References:</h1><ul style="text-align: left;"><li><a href="https://dzone.com/articles/dynamic-partition-pruning-in-spark-30" rel="nofollow" target="_blank">https://dzone.com/articles/dynamic-partition-pruning-in-spark-30</a> </li><li><a href="https://medium.com/@prabhakaran.electric/spark-3-0-feature-dynamic-partition-pruning-dpp-to-avoid-scanning-irrelevant-data-1a7bbd006a89" rel="nofollow" target="_blank">https://medium.com/@prabhakaran.electric/spark-3-0-feature-dynamic-partition-pruning-dpp-to-avoid-scanning-irrelevant-data-1a7bbd006a89</a> </li><li><a href="https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark" rel="nofollow" target="_blank">https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark</a> </li><li><a href="https://www.slideshare.net/databricks/dynamic-partition-pruning-in-apache-spark" rel="nofollow" target="_blank">https://www.slideshare.net/databricks/dynamic-partition-pruning-in-apache-spark</a> </li><li><a href="https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-dynamic-partition-pruning/read#configuration " rel="nofollow" target="_blank">https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-dynamic-partition-pruning/read#configuration </a><br /></li></ul><p> </p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-43410482598540328572021-03-11T14:33:00.004-08:002021-03-11T14:37:34.798-08:00How to use NVIDIA GPUs in docker container<h1 style="text-align: left;">Goal:</h1><p>This is a quick note on how to use NVIDIA GPUs in docker container.<br /></p><h1 style="text-align: left;">Env:</h1><p>Ubuntu 18.04</p><p><span>Docker 20.10.5<br /></span></p><a name='more'></a> <p></p><h1 style="text-align: left;">Solution:</h1><p>The key is to install <b><i>NVIDIA Container Toolkit</i></b> that is why this note is quick:) <br /></p><h3 style="text-align: left;">1. Install Docker on host machine where NVIDIA driver is already installed.<br /></h3><p><a href="https://docs.docker.com/engine/install/ubuntu/">https://docs.docker.com/engine/install/ubuntu/</a></p><div style="text-align: left;">Note: Refer to this post on <a href="http://www.openkb.info/2021/03/how-to-intall-cuda-toolkit-and-nvidia.html" rel="" target="_blank">how to install CUDA Toolkit and NVIDIA Driver on ubuntu</a>. <br /></div><h3 style="text-align: left;">2. Install <i>NVIDIA Container Toolkit</i> on host machine<br /></h3><p><a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker " rel="nofollow" target="_blank">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker </a><br /></p><h3 style="text-align: left;">3. Test<br /></h3><pre class="brush:bash; toolbar: false; auto-links: false">sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi</pre><p>Or only expose the first GPU(with device=0) instead of all GPU to docker container: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">docker run --rm --gpus device=0 nvidia/cuda:11.0-base nvidia-smi</pre>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0tag:blogger.com,1999:blog-929270410515568702.post-1093777288831671792021-03-10T12:21:00.003-08:002021-03-10T12:21:24.241-08:00Understanding RAPIDS Accelerator For Apache Spark parameter -- spark.rapids.memory.gpu.allocFraction and GPU pool related ones.<h1 style="text-align: left;">Goal:</h1><p>This article explains the RAPIDS Accelerator For Apache Spark parameter -- <b><i>spark.rapids.memory.gpu.allocFraction</i></b> and other GPU memory pool related ones: <b><i>spark.rapids.memory.gpu.maxAllocFraction</i></b>, <b><i>spark.rapids.memory.gpu.reserve</i></b>, <b><i>spark.rapids.memory.gpu.debug</i></b> and <a name="user-content-memory.gpu.pool"></a><i><b>spark.rapids.memory.gpu.pool</b></i>. <span></span></p><a name='more'></a><p></p><h1 style="text-align: left;">Env:</h1><p>Spark 3.1.1</p><p>RAPIDS Accelerator For Apache Spark 0.4</p><p>Quadro RTX 6000 with 24G memory<br /></p><h1 style="text-align: left;">Solution:</h1><h3 style="text-align: left;">1. Concept <br /></h3><p>As per the <a href="https://github.com/NVIDIA/spark-rapids/blob/main/docs/configs.md" rel="nofollow" target="_blank">configuration guide</a>, <b><i>spark.rapids.memory.gpu.pooling.enabled</i></b> is DEPRECATED and we should use <b><i>spark.rapids.memory.gpu.pool</i></b> to switch on or off the GPU memory pooling feature, and also to choose which RMM(RAPIDS Memory Manager) pooling allocator to use. </p><ul style="text-align: left;"><li>ARENA: rmm::mr::arena_memory_resource</li><li>DEFAULT: rmm::mr::pool_memory_resource</li><li>NONE: Turn off pooling, and RMM just passes through to CUDA memory allocation directly<br /></li></ul><p> Even though the value "DEFAULT" could be confusing, but as of now, we would recommend "ARENA". </p><p>To learn more about RMM, this blog "<a href="https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/" rel="nofollow" target="_blank">Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager</a>" would help understand.</p><p>If you want to dig into the source code of RMM, here it is: <a href="https://github.com/rapidsai/rmm" rel="nofollow" target="_blank">https://github.com/rapidsai/rmm</a> .<br /></p><p>In this article, I will use ARENA for all below tests.<br /></p><p>After GPU memory pooling is enabled, below 3 parameters control how much memory will be pooled:</p><ul style="text-align: left;"><li><b><i>spark.rapids.memory.gpu.allocFraction</i></b> : The fraction of total GPU memory that should be initially allocated for pooled memory. Default 0.9.</li><li><b><i>spark.rapids.memory.gpu.maxAllocFraction</i></b>: The fraction of total GPU memory that limits the maximum size of the RMM pool. Default 1.0<br /></li><li><b><i>spark.rapids.memory.gpu.reserve</i></b> : The amount of GPU memory that should remain unallocated by RMM and left for system use such as memory needed for kernels, kernel launches or JIT compilation. Default 1g.<br /></li></ul><p>In simple, basically the default setting means, 90% of the GPU memory will be pooled but the max can not exceed 100% - 1g.</p><p>In the end, there is another parameter <b><i>spark.rapids.memory.gpu.debug</i></b> which can be used to enable debug logging into STDOUT or STDERR. Default is NONE.<br /></p><h3 style="text-align: left;">2. Test<br /></h3><p>In below tests, I keep <b><i>spark.rapids.memory.gpu.maxAllocFraction</i></b> = default 1 and change <b><i>spark.rapids.memory.gpu.allocFraction</i></b> and <b><i>spark.rapids.memory.gpu.reserve</i></b> and in the meantime monitoring the logs and <a href="http://www.openkb.info/2021/03/how-to-monitor-nvidia-gpu-performance.html" rel="nofollow" target="_blank">nvidia-smi</a> output after "spark-shell" is launched with only 1 executor on a single node.<br /></p><h4 style="text-align: left;"><b>a. Default </b></h4>
<pre class="brush:sql; toolbar: false; auto-links: false">spark.rapids.memory.gpu.allocFraction 0.9 (default)<br />spark.rapids.memory.gpu.reserve 1073741824 (default)</pre>
<p>GPU memory utilization:</p><pre class="brush:bash; toolbar: false; auto-links: false">utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]<br />0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB<br />4 %, 0 %, 24220 MiB, 23801 MiB, 419 MiB<br />3 %, 0 %, 24220 MiB, 1719 MiB, 22501 MiB<br />0 %, 0 %, 24220 MiB, 1719 MiB, 22501 MiB<br />0 %, 0 %, 24220 MiB, 1693 MiB, 22527 MiB</pre><p>Executor Log: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">21/03/10 10:42:25 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin<br />21/03/10 10:42:30 INFO GpuDeviceManager: Initializing RMM ARENA initial size = 21798.28125 MB, max size = 23196.3125 MB on gpuId 0</pre>
<p style="text-align: left;"></p><h4 style="text-align: left;">b. Increased spark.rapids.memory.gpu.allocFraction from 0.9 to 0.99</h4>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.rapids.memory.gpu.allocFraction=0.99<br />spark.rapids.memory.gpu.reserve 1073741824 (default)</pre>
<p>GPU memory utilization:</p><pre class="brush:bash; toolbar: false; auto-links: false">utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]<br />0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB<br />0 %, 0 %, 24220 MiB, 24161 MiB, 59 MiB<br />3 %, 0 %, 24220 MiB, 23723 MiB, 497 MiB<br />0 %, 0 %, 24220 MiB, 321 MiB, 23899 MiB<br />0 %, 0 %, 24220 MiB, 321 MiB, 23899 MiB<br />0 %, 0 %, 24220 MiB, 321 MiB, 23899 MiB<br />0 %, 0 %, 24220 MiB, 297 MiB, 23923 MiB</pre><p>Executor Log: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">21/03/10 10:46:54 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin<br />21/03/10 10:46:59 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than free memory (23519.3125 MB)<br />21/03/10 10:46:59 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than the adjusted maximum allocation (23196.3125 MB), lowering initial allocation to the adjusted maximum allocation.<br />21/03/10 10:46:59 INFO GpuDeviceManager: Initializing RMM ARENA initial size = 23196.3125 MB, max size = 23196.3125 MB on gpuId 0</pre>
<p style="text-align: left;"></p><h4 style="text-align: left;">c. Increased spark.rapids.memory.gpu.allocFraction from 0.9 to 0.99 and also spark.rapids.memory.gpu.reserve from 1g to 2g</h4>
<pre class="brush:bash; toolbar: false; auto-links: false">spark.rapids.memory.gpu.allocFraction=0.99<br />spark.rapids.memory.gpu.reserve 2147483648</pre>
<p>GPU memory utilization:</p><pre class="brush:bash; toolbar: false; auto-links: false">utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]<br />0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB<br />4 %, 0 %, 24220 MiB, 24041 MiB, 179 MiB<br />5 %, 0 %, 24220 MiB, 23711 MiB, 509 MiB<br />0 %, 0 %, 24220 MiB, 1345 MiB, 22875 MiB<br />0 %, 0 %, 24220 MiB, 1345 MiB, 22875 MiB<br />0 %, 0 %, 24220 MiB, 1345 MiB, 22875 MiB<br />0 %, 0 %, 24220 MiB, 1321 MiB, 22899 MiB </pre><p>Executor Log: <br /></p><pre class="brush:bash; toolbar: false; auto-links: false">21/03/10 10:49:49 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin<br />21/03/10 10:49:54 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than free memory (23519.3125 MB)<br />21/03/10 10:49:54 WARN GpuDeviceManager: Initial RMM allocation (23978.109375 MB) is larger than the adjusted maximum allocation (22172.3125 MB), lowering initial allocation to the adjusted maximum allocation.<br />21/03/10 10:49:54 INFO GpuDeviceManager: Initializing RMM ARENA initial size = 22172.3125 MB, max size = 22172.3125 MB on gpuId 0</pre>
<p style="text-align: left;"></p><h4 style="text-align: left;"> d. Disable GPU memory pool</h4><pre class="brush:bash; toolbar: false; auto-links: false">spark.rapids.memory.gpu.pool NONE</pre><p>GPU memory utilization:</p><pre class="brush:bash; toolbar: false; auto-links: false">utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]<br />0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB<br />3 %, 0 %, 24220 MiB, 23891 MiB, 329 MiB<br />5 %, 0 %, 24220 MiB, 23567 MiB, 653 MiB<br />0 %, 0 %, 24220 MiB, 23519 MiB, 701 MiB<br />0 %, 0 %, 24220 MiB, 23519 MiB, 701 MiB<br />1 %, 0 %, 24220 MiB, 23495 MiB, 725 MiB</pre>
<pre class="brush:bash; toolbar: false; auto-links: false">21/03/10 12:03:07 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin<br />21/03/10 12:03:12 INFO GpuDeviceManager: Initializing RMM initial size = 21798.28125 MB, max size = 0.0 MB on gpuId 0</pre>
<h4 style="text-align: left;">e. Enable DEBUG <br /></h4><pre class="brush:bash; toolbar: false; auto-links: false">spark.rapids.memory.gpu.debug STDOUT</pre>
<p>stdout:<br /></p><pre class="brush:bash; toolbar: false; auto-links: false">$ tail -100f stdout<br />Thread,Time,Action,Pointer,Size,Stream<br />15129,11:04:56:292725,allocate,0x7f7192600000,18480,0x0<br />15129,11:04:56:293529,allocate,0x7f7140000000,50686648,0x0<br />15129,11:04:56:317040,allocate,0x7f7143200000,14174424,0x0<br />15129,11:04:56:319691,allocate,0x7f7192800000,13951936,0x0<br />15129,11:04:56:321843,allocate,0x7f713e000000,13936328,0x0<br />15129,11:04:56:323874,allocate,0x7f713ee00000,13929272,0x0<br />15129,11:04:56:325937,allocate,0x7f7192604a00,26432,0x0<br />15129,11:04:56:326309,allocate,0x7f7134000000,139910792,0x0<br />15129,11:04:56:326346,allocate,0x7f719260b200,13216,0x0<br />15129,11:04:56:326371,allocate,0x7f719260e600,6608,0x0<br />15129,11:04:56:370310,free,0x7f719260e600,6608,0x0<br />15129,11:04:56:370327,free,0x7f719260b200,13216,0x0<br />15129,11:04:56:370335,free,0x7f7140000000,50686648,0x0<br />15129,11:04:56:370490,free,0x7f7143200000,14174424,0x0<br />15129,11:04:56:371885,free,0x7f7192800000,13951936,0x0</pre><h3 style="text-align: left;">3. Key takeaways</h3><p>Allocating memory on a GPU can be an expensive operation so it is recommended to use GPU memory pool feature. </p><p>DEBUG log is useful because it can show each allocate/free actions.<br /></p><p><br /></p>OpenKBhttp://www.blogger.com/profile/02892129494774761942noreply@blogger.com0