Thursday, January 28, 2021

Spark Code -- How to replace Null values in DataFrame/Dataset

Goal:

This article shares some Scala example codes to explain how to replace Null values in DataFrame/Dataset.

Solution:

Note: As per the the code and API for org.apache.spark.sql, DataFrame is basically Dataset[Row].

So in the future, we are always checking the code or API for Dataset when researching on DataFrame/Dataset.

Dataset has an Untyped transformations named "na" which is DataFrameNaFunctions:

def na: DataFrameNaFunctions 

DataFrameNaFunctions has methods named "fill" with different signatures to replace NULL values for different datatype columns.

Let's create a sample Dataframe firstly as the data source:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
Row("Jerry ",null,"Brown",null,5000.67,null,false),
Row("Henry ","B","Jones",66666,null,20015464564L,true)
)

val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("zipcode", IntegerType, true),
StructField("salary", DoubleType, true),
StructField("account", LongType, true),
StructField("isAlive", BooleanType, true)
))

val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

Data source and its schema look as below:

scala> df.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- zipcode: integer (nullable = true)
|-- salary: double (nullable = true)
|-- account: long (nullable = true)
|-- isAlive: boolean (nullable = true)


scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

1. Replace Null in ALL numeric columns.

Here it includes ALL IntegerType, DoubleType and LongType columns.

scala> df.na.fill(0).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| 0|5000.67| 0| false|
| Henry | B| Jones| 66666| 0.0|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

2. Replaces Null in specified numeric columns.

For example, include only the numeric column named "account".

scala> df.na.fill(0,Array("account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| 0| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

3. Replace Null in ALL string columns.

scala> df.na.fill("").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | | Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

4. Replace Null in ALL boolean columns.

scala> df.na.fill(true).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| true|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

 

Note: Here is the Complete Sample Code.

No comments:

Post a Comment

Popular Posts