Thursday, March 4, 2021

Error java.lang.NoSuchMethodException when running spark-sql-perf with Hive Metastore 3.x

Symptom:

When using spark-sql-perf with a Hive Metastore 3.x, it fails with error:

java.lang.NoSuchMethodException: 
org.apache.hadoop.hive.ql.metadata.Hive.alterTable(java.lang.String, org.apache.hadoop.hive.ql.metadata.Table, org.apache.hadoop.hive.metastore.api.EnvironmentContext)

Env:

Hive 3.x

Wednesday, March 3, 2021

Docker for Mac: Could not find ~/Library/Containers/com.docker.docker/Data/vms/0/tty to access volume location

Symptom:

If we want to access the volume location on the host machine, we normally inspect the volume to get the location on the host machine:

$ docker volume inspect todo-db|grep Mountpoint
"Mountpoint": "/var/lib/docker/volumes/todo-db/_data",

For Linux host machine, it is straightforward. 

But for Mac, the Docker commands are running inside a small VM. So we normally use below command to login Docker's VM on Mac firstly:

screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty

However in latest Docker Desktop for Mac version,  "tty" does not exist so it will fail to login.

Env:

Docker Desktop for Mac 20.10.3

MacOS Catalina

Thursday, February 25, 2021

Spark Code -- Unified Memory Manager

Goal:

This article digs into Unified Memory Manager which is the default memory management framework for Spark after 1.6. 

We will explain why there is a little difference in executor memory size between Spark UI and Executor log.

Wednesday, February 24, 2021

Spark on GPU -- Hands on GCP Dataproc to test Spark on GPU using RAPIDS

Goal:

This article shares a quick hands-on experience to test Spark on GPU using RAPIDS in GCP Dataproc.

We will provide step-by-step instructions on how to create a single node Dataproc cluster to test Spark on GPU.

Wednesday, February 17, 2021

Spark Tuning -- Understanding the Spill from a Cartesian Product

Goal:

This article explains how to understand the spilling from a Cartesian Product.

We will explain the meaning of below 2 parameters, and also the metrics "Shuffle Spill (Memory)" and "Shuffle Spill (Disk) " on webUI.

  • spark.sql.cartesianProductExec.buffer.in.memory.threshold
  • spark.sql.cartesianProductExec.buffer.spill.threshold

Tuesday, February 16, 2021

Spark Tuning -- explaining Spark SQL Join Types

Goal:

This article explains the different types of joins in Spark SQL using sample query and explain plan. 

We will talk about each use case, code logic for join selection and hints.

Wednesday, February 10, 2021

Spark Code -- Which Spark SQL data type isOrderable?

Goal:

This article does some code analysis on which Spark SQL data type is Order-able or Sort-able.

We will look into the source code logic for method "isOrderable" of object org.apache.spark.sql.catalyst.expressions.RowOrdering.

The reason why we are interested into method "isOrderable" is this method is used by SparkStrategies.scala to choose join types which we will dig deeper more in another post.

Monday, February 8, 2021

Spark Tuning -- Understand Cost Based Optimizer in Spark

Goal:

This article explains Spark CBO(Cost Based Optimizer) with examples and shares how to check the table statistics.

Friday, February 5, 2021

Popular Posts