Goal:How does Drill 1.4 improve the performance of "limit 0" queries?
1. BackgroundBI tools like Tableau normally generates queries like "select * from sometable limit 0" to get the column names and types.
Before Drill 1.4, Drill has to scan the underline data physically to get this information. If there are too many files on HDFS, or too many partitions in Hive tables, the performance of "limit 0" queries will become much worse.
Starting from Drill 1.4, Drill can utilize the schema information of certain views or Hive tables to quickly return the column names and types, bypassing reading the underline data files physically.
2. ExamplesFor Hive tables, it is easy to understand that Drill only needs to check Hive metastore to fetch the column names and types;
For Views with proper "cast" for each returning column, Drill just needs to check the view definition file to fetch the column names and types. Here is one example:
create view dfs.myview as SELECT CAST(`col1` AS INTEGER) AS `col1`, CAST(`col2` AS VARCHAR) AS `col2` FROM `dfs`.`example/mydirectory`;Then below "limit 0" queries can benefit:
select * from dfs.myview limit 0; SELECT * FROM (SELECT SUM(1) AS `COL` FROM dfs.myview HAVING COUNT(1)>0) T LIMIT 0;
3. Explain planTake query "select * from dfs.myview limit 0" for example:
Before Drill 1.4, the visualized plan is:
Starting from Drill 1.4, the visualized plan is:
Note: In Drill 1.6, the visualized plan changed to:
If the "limit 0" query benefits from this feature, the keyword in the plan is "org.apache.drill.exec.planner.sql.handlers.FindLimit0Visitor$RelDataTypeReader".
4. Performance differenceThis is not an official benchmark result and it is just based on my lab test. So below performance number may not apply to other environments.
The test query "select * from dfs.myview limit 0" is on a view on HDFS based on 20K 1MB parquet file.
Before Drill 1.4, it takes several minutes to finish, and the time will grow with the number of files;
With Drill 1.4, it only takes 1 second no matter how many files are on HDFS.