Goal:
This article shares some Scala example codes to explain how to drop Null values in DataFrame/Dataset.
Solution:
DataFrameNaFunctions has methods named "drop" with different signatures to drop NULL values under different scenarios.
Let's create a sample Dataframe firstly as the data source:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}
val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
Row("Jerry ",null,"Brown",null,5000.67,null,false),
Row("Henry ","B","Jones",66666,null,20015464564L,true)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("zipcode", IntegerType, true),
StructField("salary", DoubleType, true),
StructField("account", LongType, true),
StructField("isAlive", BooleanType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
Data source and its schema look as below:
scala> df.printSchema()
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- zipcode: integer (nullable = true)
|-- salary: double (nullable = true)
|-- account: long (nullable = true)
|-- isAlive: boolean (nullable = true)
scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
1. Drop rows containing NULL in any columns.(version 1)
Here only one row does not have NULL in any columns.
scala> df.na.drop().show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
+---------+----------+--------+-------+-------+-----------+-------+
2. Drop rows containing NULL in any columns. (version 2)
Same as above. This is just another version.
scala> df.na.drop("any").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
+---------+----------+--------+-------+-------+-----------+-------+
3. Drop rows containing NULL in all columns.
Here it shows all rows because there is no such all-NULL rows.
scala> df.na.drop("all").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
4. Drop rows containing NULL in any of specified column(s).
scala> df.na.drop(Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
+---------+----------+--------+-------+-------+-----------+-------+
5. Drop rows containing NULL in all of specified column(s).
scala> df.na.drop("all",Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
| Tom| A| Smith| 44444|4000.45|19886546456| null|
| Jerry | null| Brown| null|5000.67| null| false|
| Henry | B| Jones| 66666| null|20015464564| true|
+---------+----------+--------+-------+-------+-----------+-------+
6. Drop rows containing less than minNonNulls non-null values.
It means we keep the rows with at least minNonNulls non-null values.
Here I want to keep the rows with all 7 non-null values.
scala> df.na.drop(7).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary| account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
| Jim| | Green| 33333|3000.12|19605466456| true|
+---------+----------+--------+-------+-------+-----------+-------+
Note: Here is the Complete Sample Code.
70726E37CC
ReplyDeletekiralık hacker
hacker arıyorum
kiralık hacker
hacker arıyorum
belek
C96D079DAB
ReplyDeleteTakipçi Satın Al
Whiteout Survival Hediye Kodu
Google Konum Ekleme
Lords Mobile Promosyon Kodu
MFF Kupon Kodu