Thursday, January 28, 2021

Spark Code -- How to replace Null values in DataFrame/Dataset

Goal:

This article shares some Scala example codes to explain how to replace Null values in DataFrame/Dataset.

Solution:

Note: As per the the code and API for org.apache.spark.sql, DataFrame is basically Dataset[Row].

So in the future, we are always checking the code or API for Dataset when researching on DataFrame/Dataset.

Dataset has an Untyped transformations named "na" which is DataFrameNaFunctions:

def na: DataFrameNaFunctions 

DataFrameNaFunctions has methods named "fill" with different signatures to replace NULL values for different datatype columns.

Let's create a sample Dataframe firstly as the data source:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
Row("Jerry ",null,"Brown",null,5000.67,null,false),
Row("Henry ","B","Jones",66666,null,20015464564L,true)
)

val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("zipcode", IntegerType, true),
StructField("salary", DoubleType, true),
StructField("account", LongType, true),
StructField("isAlive", BooleanType, true)
))

val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

Data source and its schema look as below:

scala> df.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- zipcode: integer (nullable = true)
|-- salary: double (nullable = true)
|-- account: long (nullable = true)
|-- isAlive: boolean (nullable = true)


scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

1. Replace Null in ALL numeric columns.

Here it includes ALL IntegerType, DoubleType and LongType columns.

scala> df.na.fill(0).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| 0|5000.67| 0| false|
| Henry | B| Jones| 66666| 0.0|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

2. Replaces Null in specified numeric columns.

For example, include only the numeric column named "account".

scala> df.na.fill(0,Array("account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| 0| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

3. Replace Null in ALL string columns.

scala> df.na.fill("").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | | Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

4. Replace Null in ALL boolean columns.

scala> df.na.fill(true).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| true|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+

 

Note: Here is the Complete Sample Code.

1 comment:

  1. But it will make it very straightforward for a player to identify the worth he’s paying. There is a cumulative nature to a slot machine’s maintain, according to writer Natasha Dow Schüll. A 90% payback share doesn’t mean a player will only lose $10 occasion that they} bet 카지노사이트 $100. It means they may lose 10% every bet and progressively lose all of it. Choose amongst 40 exciting table video games, with pleasant sellers and an environment of unparalleled consolation. Try your hand at favorites such as Blackjack or Roulette, or take a shot at Craps, Head’s Up Hold’em, and Flop Poker.

    ReplyDelete

Popular Posts