Goal:
This article shares some Scala example codes to explain how to replace Null values in DataFrame/Dataset.
Solution:
Note: As per the the code and API for org.apache.spark.sql, DataFrame is basically Dataset[Row].
So in the future, we are always checking the code or API for Dataset when researching on DataFrame/Dataset.
Dataset has an Untyped transformations named "na" which is DataFrameNaFunctions:
def na: DataFrameNaFunctions
DataFrameNaFunctions has methods named "fill" with different signatures to replace NULL values for different datatype columns.
Let's create a sample Dataframe firstly as the data source:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}
val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
Row("Jerry ",null,"Brown",null,5000.67,null,false),
Row("Henry ","B","Jones",66666,null,20015464564L,true)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("zipcode", IntegerType, true),
StructField("salary", DoubleType, true),
StructField("account", LongType, true),
StructField("isAlive", BooleanType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
Data source and its schema look as below:
scala> df.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- zipcode: integer (nullable = true)
|-- salary: double (nullable = true)
|-- account: long (nullable = true)
|-- isAlive: boolean (nullable = true)
scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
1. Replace Null in ALL numeric columns.
Here it includes ALL IntegerType, DoubleType and LongType columns.
scala> df.na.fill(0).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| 0|5000.67| 0| false|
| Henry | B| Jones| 66666| 0.0|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
2. Replaces Null in specified numeric columns.
For example, include only the numeric column named "account".
scala> df.na.fill(0,Array("account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| 0| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
3. Replace Null in ALL string columns.
scala> df.na.fill("").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | | Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
4. Replace Null in ALL boolean columns.
scala> df.na.fill(true).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| true|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
Note: Here is the Complete Sample Code.
No comments:
Post a Comment