Goal:
This article introduces the behavior of Hive outer join.Solution:
From Hive Outer Join Behavior, here are the definitions of Preserved Row table, Null Supplying table, During Join predicate and After Join predicate(Where predicate).Take a left outer join for example:
select t1.*,t2.name as name_t2 from t1 left outer join t2 on (t1.id=t2.id and t2.name='aa' ) where t1.name='a';Then
- Preserved Row table: t1.
- Null Supplying table: t2.
- During Join predicate: t2.name='aa'.
- After Join predicate(Where predicate): t1.name='a'.
- During Join predicates cannot be pushed past Preserved Row tables.
- After Join predicates cannot be pushed past Null Supplying tables.
To explorer the behaviors, let's create below 2 Hive tables with sample data:
create table t1 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; create table t2 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";Data for t1:
1,a 1,b 2,a 2,b 3,a 3,b 4,bData for t2:
1,aa 4,ddTo show a consistent plan using distributed join instead of map join, we disables hive.auto.convert.join in below tests.
set hive.auto.convert.join=false;
Case 1: Both tables have After Join predicates.
SQL:select t1.*,t2.name as name_t2
from t1 left outer join t2 on (t1.id=t2.id)
where t1.name='a' and t2.name='aa';
Result:1 a aaSQL plan:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (name = 'a') (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
value expressions: id (type: int), name (type: string)
TableScan
alias: t2
Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE
value expressions: name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col1}
outputColumnNames: _col0, _col1, _col5
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Filter Operator
predicate: (_col5 = 'aa') (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Only After Join Predicate of Preserved Row table "t1" is pushed before the join.After Join Predicate of Null Supplying table "t2" is executed as a filter on the output of the join.
The logic is:
( (t1 where t1.name='a') left outer join t2 ) where t2.name='aa'
Case 2: Null Supplying table has During Join predicate, Preserved Row table has After Join predicate
SQL:select t1.*,t2.name as name_t2 from t1 left outer join t2 on (t1.id=t2.id and t2.name='aa' ) where t1.name='a';Result:
1 a aa 2 a NULL 3 a NULLSQL Plan:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (name = 'a') (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
value expressions: id (type: int), name (type: string)
TableScan
alias: t2
Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (name = 'aa') (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
value expressions: name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col1}
outputColumnNames: _col0, _col1, _col5
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Both predicates are pushed before the join.The logic is:
(t1 where t1.name='a') left outer join (t2 where t2.name='aa')
Case 3: Null Supplying table has After Join predicate, Preserved Row table has During Join predicate
SQL:select t1.*,t2.name as name_t2 from t1 left outer join t2 on (t1.id=t2.id and t1.name='a' ) where t2.name='aa';Result:
1 a aaSQL Plan:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE
value expressions: id (type: int), name (type: string)
TableScan
alias: t2
Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE
value expressions: name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col1}
filter predicates:
0 {(VALUE._col1 = 'a')}
1
outputColumnNames: _col0, _col1, _col5
Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (_col5 = 'aa') (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Neither of the two predicates are pushed before the join.The logic is:
(t1 left outer join t2 ) where t1.name='a' and t2.name='aa'
Case 4: Both tables have During Join predicates.
SQL:select t1.*,t2.name as name_t2
from t1 left outer join t2 on
(t1.id=t2.id and t1.name='a' and t2.name='aa');
Result:1 a aa 1 b NULL 2 a NULL 2 b NULL 3 a NULL 3 b NULL 4 b NULLSQL Plan:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE
value expressions: id (type: int), name (type: string)
TableScan
alias: t2
Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (name = 'aa') (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
value expressions: name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col1}
filter predicates:
0 {(VALUE._col1 = 'a')}
1
outputColumnNames: _col0, _col1, _col5
Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Only During Join Predicate of Null Supplying table "t2" is pushed before the join.During Join Predicate of Preserved Row table "t1" is ignored.
The logic is:
( t1 left outer join (t2 where t2.name='aa') )
Key takeaways:
- The SQL result depends on how you write the SQL.
- During Join predicates of Null Supplying tables can be pushed.
- After Join predicates of Preserved Row tables can be pushed.
- During Join Predicate of Preserved Row table "t1" is ignored.
- Sometimes if the predicates of both sides need to be pushed, please refer to case 2.
For example, imagine both "t1" and "t2" are partition tables based on partition key "name", you may want partition pruning to happen before join.
947F5AB51A
ReplyDeleteTakipçi Satın Al
Footer Link Satın Al
Telegram Coin Botları
İdle Office Tycoon Hediye Kodu
MFF Kupon Kodu