Goal:
How to improve the performance of Drill query with many in-lists.Env:
Drill 1.1Root Cause:
For Drill query like: select * from table where col in (1,2,3,...);The in-list evaluation is done in sequential. It means if you increase the number of in-list, the performance of that query could degrade linearly.
However if the number of in-list elements reaches 20(including 20), Drill can optimize this query to use a in-memory hash table to store the in-list elements, and then do a table join instead. This optimization can increase the performance a lot.
See the differences in below 2 plans.
a. No optimization
explain plan for
select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19);
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])
00-02 Project($f0=[1])
00-03 SelectionVectorRemover
00-04 Filter(condition=[OR(=(CAST($0):INTEGER, 1), =(CAST($0):INTEGER, 2), =(CAST($0):INTEGER, 3), =(CAST($0):INTEGER, 4), =(CAST($0):INTEGER, 5), =(CAST($0):INTEGER, 6), =(CAST($0):INTEGER, 7), =(CAST($0):INTEGER, 8), =(CAST($0):INTEGER, 9), =(CAST($0):INTEGER, 10), =(CAST($0):INTEGER, 11), =(CAST($0):INTEGER, 12), =(CAST($0):INTEGER, 13), =(CAST($0):INTEGER, 14), =(CAST($0):INTEGER, 15), =(CAST($0):INTEGER, 16), =(CAST($0):INTEGER, 17), =(CAST($0):INTEGER, 18), =(CAST($0):INTEGER, 19))])
00-05 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])
b. With optimizationexplain plan for
select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])
00-02 Project($f0=[1])
00-03 Project(f6=[$1], ROW_VALUE=[$0])
00-04 MergeJoin(condition=[=($1, $0)], joinType=[inner])
00-06 SelectionVectorRemover
00-08 Sort(sort0=[$0], dir0=[ASC])
00-10 HashAgg(group=[{0}])
00-12 Values
00-05 SelectionVectorRemover
00-07 Sort(sort0=[$0], dir0=[ASC])
00-09 Project(f6=[CAST($0):INTEGER])
00-11 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])
Solution:
Add duplicate in-list columns to make the number of in-list reach 20 to take advantage of this optimization.For example, change the where condition from:
"where col1 in (1,2,3)"
To:
"where col1 in (1,2,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)"
Improvement JIRA DRILL-3710 is filed to make number "20" configurable.
4245A78A14
ReplyDeletesteroid fiyatları
Görüntülü Şov
steroid al