Goal:
This article does some code analysis on which Spark SQL data type is Order-able or Sort-able.
We will look into the source code logic for method "isOrderable" of object org.apache.spark.sql.catalyst.expressions.RowOrdering.
Env:
Spark 2.4 source code
Solution:
The source code for method "isOrderable" is:
1 2 3 4 5 6 7 8 9 10 11 | /** * Returns true iff the data type can be ordered (i.e. can be sorted). */ def isOrderable(dataType : DataType) : Boolean = dataType match { case NullType = > true case dt : AtomicType = > true case struct : StructType = > struct.fields.forall(f = > isOrderable(f.dataType)) case array : ArrayType = > isOrderable(array.elementType) case udt : UserDefinedType[ _ ] = > isOrderable(udt.sqlType) case _ = > false } |
So basically any NullType or AtomicType should be Order-able. For other complex types, it depends on their element types.
Let's now take a look at ALL of the Spark SQL data types in org.apache.spark.sql.types
1. NullType
1 2 3 4 5 | import org.apache.spark.sql.catalyst.expressions.RowOrdering import org.apache.spark.sql.types. _ scala> RowOrdering.isOrderable(NullType) res 0 : Boolean = true |
2. class which extends AtomicType
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | scala> RowOrdering.isOrderable(BinaryType) res 29 : Boolean = true scala> RowOrdering.isOrderable(BooleanType) res 6 : Boolean = true scala> RowOrdering.isOrderable(DateType) res 31 : Boolean = true RowOrdering.isOrderable(HiveStringType) scala> RowOrdering.isOrderable(StringType) res 1 : Boolean = true scala> RowOrdering.isOrderable(TimestampType) res 19 : Boolean = true |
Here is another abstract class HiveStringType which also extends AtomicType.
But as per the comment below, it should be replaced by a StringType. And it is even removed in Spark 3.1.
1 2 3 4 5 6 7 | /** * A hive string type for compatibility. These datatypes should only used for parsing, * and should NOT be used anywhere else. Any instance of these data types should be * replaced by a [[StringType]] before analysis. */ sealed abstract class HiveStringType extends AtomicType |
3. class which extends IntegralType
Basically "abstract class IntegralType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.
So any class which extends IntegralType should also be order-able:
1 2 3 4 5 6 7 8 9 10 11 | scala> RowOrdering.isOrderable(ByteType) res 11 : Boolean = true scala> RowOrdering.isOrderable(IntegerType) res 3 : Boolean = true scala> RowOrdering.isOrderable(LongType) res 13 : Boolean = true scala> RowOrdering.isOrderable(ShortType) res 14 : Boolean = true |
4. class which extends FractionalType
Basically "abstract class FractionalType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.
So any class which extends FractionalType should also be order-able:
1 2 3 4 5 6 7 8 | scala> RowOrdering.isOrderable(DecimalType( 10 , 5 )) res 17 : Boolean = true scala> RowOrdering.isOrderable(DoubleType) res 2 : Boolean = true scala> RowOrdering.isOrderable(FloatType) res 13 : Boolean = true |
5. Spark SQL data types which are not order-able
1 2 3 4 5 6 7 8 | scala> RowOrdering.isOrderable(CalendarIntervalType) res 26 : Boolean = false scala> RowOrdering.isOrderable(DataTypes.createMapType(StringType,StringType)) res 9 : Boolean = false scala> RowOrdering.isOrderable(ObjectType(classOf[java.lang.Integer])) res 23 : Boolean = false |
6. Complex Spark SQL data types
If the ArrayType's element type is order-able, then ArrayType is order-able. Vice Versa.
1 2 3 4 5 | scala> RowOrdering.isOrderable(ArrayType(IntegerType)) res 22 : Boolean = true scala> RowOrdering.isOrderable(ArrayType(CalendarIntervalType)) res 27 : Boolean = false |
If all of the field types of StructType is order-able, then StructType is order-able. Vice Versa.
1 2 3 4 5 | scala> RowOrdering.isOrderable( new StructType().add( "a" , IntegerType).add( "b" , StringType)) res 6 : Boolean = true scala> RowOrdering.isOrderable( new StructType().add( "a" , IntegerType).add( "b" , CalendarIntervalType)) res 7 : Boolean = false |
7. UserDefinedType
As per below comment, it should not be used and becomes private after Spark 2.x.
1 2 | * Note : This was previously a developer API in Spark 1 .x. We are making this private in Spark 2.0 * because we will very likely create a new version of this that works better with Datasets. |
No comments:
Post a Comment