Wednesday, February 10, 2021

Spark Code -- Which Spark SQL data type isOrderable?

Goal:

This article does some code analysis on which Spark SQL data type is Order-able or Sort-able.

We will look into the source code logic for method "isOrderable" of object org.apache.spark.sql.catalyst.expressions.RowOrdering.

The reason why we are interested into method "isOrderable" is this method is used by SparkStrategies.scala to choose join types which we will dig deeper more in another post.

Env:

Spark 2.4 source code

Solution:

The source code for method "isOrderable" is:

  /**
* Returns true iff the data type can be ordered (i.e. can be sorted).
*/
def isOrderable(dataType: DataType): Boolean = dataType match {
case NullType => true
case dt: AtomicType => true
case struct: StructType => struct.fields.forall(f => isOrderable(f.dataType))
case array: ArrayType => isOrderable(array.elementType)
case udt: UserDefinedType[_] => isOrderable(udt.sqlType)
case _ => false
}

So basically any NullType or AtomicType should be Order-able. For other complex types, it depends on their element types.

Let's now take a look at ALL of the Spark SQL data types in org.apache.spark.sql.types

1. NullType

import org.apache.spark.sql.catalyst.expressions.RowOrdering
import org.apache.spark.sql.types._

scala> RowOrdering.isOrderable(NullType)
res0: Boolean = true

2. class which extends AtomicType

scala> RowOrdering.isOrderable(BinaryType)
res29: Boolean = true

scala> RowOrdering.isOrderable(BooleanType)
res6: Boolean = true

scala> RowOrdering.isOrderable(DateType)
res31: Boolean = true

RowOrdering.isOrderable(HiveStringType)
scala> RowOrdering.isOrderable(StringType)
res1: Boolean = true

scala> RowOrdering.isOrderable(TimestampType)
res19: Boolean = true

Here is another abstract class HiveStringType which also extends AtomicType

But as per the comment below, it should be replaced by a StringType. And it is even removed in Spark 3.1.

/**
* A hive string type for compatibility. These datatypes should only used for parsing,
* and should NOT be used anywhere else. Any instance of these data types should be
* replaced by a [[StringType]] before analysis.
*/

sealed abstract class HiveStringType extends AtomicType

3. class which extends IntegralType

Basically "abstract class IntegralType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala

So any class which extends IntegralType should also be order-able:

scala> RowOrdering.isOrderable(ByteType)
res11: Boolean = true

scala> RowOrdering.isOrderable(IntegerType)
res3: Boolean = true

scala> RowOrdering.isOrderable(LongType)
res13: Boolean = true

scala> RowOrdering.isOrderable(ShortType)
res14: Boolean = true

4. class which extends FractionalType

Basically "abstract class FractionalType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.

So any class which extends FractionalType should also be order-able:

scala> RowOrdering.isOrderable(DecimalType(10,5))
res17: Boolean = true

scala> RowOrdering.isOrderable(DoubleType)
res2: Boolean = true

scala> RowOrdering.isOrderable(FloatType)
res13: Boolean = true

5. Spark SQL data types which are not order-able

scala> RowOrdering.isOrderable(CalendarIntervalType)
res26: Boolean = false

scala> RowOrdering.isOrderable(DataTypes.createMapType(StringType,StringType))
res9: Boolean = false

scala> RowOrdering.isOrderable(ObjectType(classOf[java.lang.Integer]))
res23: Boolean = false

6. Complex Spark SQL data types

If the ArrayType's element type is order-able, then ArrayType is order-able. Vice Versa.

scala> RowOrdering.isOrderable(ArrayType(IntegerType))
res22: Boolean = true

scala> RowOrdering.isOrderable(ArrayType(CalendarIntervalType))
res27: Boolean = false

If all of the field types of StructType is order-able, then StructType is order-able. Vice Versa. 

scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", StringType))
res6: Boolean = true

scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", CalendarIntervalType))
res7: Boolean = false

7. UserDefinedType

As per below comment, it should not be used and becomes private after Spark 2.x.

 * Note: This was previously a developer API in Spark 1.x. We are making this private in Spark 2.0
* because we will very likely create a new version of this that works better with Datasets.


No comments:

Post a Comment

Popular Posts