Thursday, January 28, 2021

Spark Code -- How to drop Null values in DataFrame/Dataset

Goal:

This article shares some Scala example codes to explain how to drop Null values in DataFrame/Dataset.

Solution:

DataFrameNaFunctions has methods named "drop" with different signatures to drop NULL values under different scenarios.

Let's create a sample Dataframe firstly as the data source:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
Row("Jerry ",null,"Brown",null,5000.67,null,false),
Row("Henry ","B","Jones",66666,null,20015464564L,true)
)

val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("zipcode", IntegerType, true),
StructField("salary", DoubleType, true),
StructField("account", LongType, true),
StructField("isAlive", BooleanType, true)
))

val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

Data source and its schema look as below:

scala> df.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- zipcode: integer (nullable = true)
|-- salary: double (nullable = true)
|-- account: long (nullable = true)
|-- isAlive: boolean (nullable = true)


scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

1. Drop rows containing NULL in any columns.(version 1)

Here only one row does not have NULL in any columns.

scala> df.na.drop().show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
+---------+----------+--------+-------+-------+-----------+-------+

2. Drop rows containing NULL in any columns. (version 2)

Same as above. This is just another version.

scala> df.na.drop("any").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
+---------+----------+--------+-------+-------+-----------+-------+

 3. Drop rows containing NULL in all columns.

Here it shows all rows because there is no such all-NULL rows.

scala> df.na.drop("all").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

 4. Drop rows containing NULL in any of specified column(s).

scala> df.na.drop(Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
+---------+----------+--------+-------+-------+-----------+-------+

5. Drop rows containing NULL in all of specified column(s).

scala> df.na.drop("all",Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

6. Drop rows containing less than minNonNulls non-null values. 

It means we keep the rows with at least minNonNulls non-null values. 

Here I want to keep the rows with all 7 non-null values.

scala> df.na.drop(7).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
+---------+----------+--------+-------+-------+-----------+-------+

Note: Here is the Complete Sample Code.

No comments:

Post a Comment

Popular Posts