How does Drill 1.4 improve the performance of "limit 0" queries

Tuesday, January 12, 2016

How does Drill 1.4 improve the performance of "limit 0" queries

Env:

Drill 1.4

Goal:

How does Drill 1.4 improve the performance of "limit 0" queries?

Solution:

1. Background

BI tools like Tableau normally generates queries like "select * from sometable limit 0" to get the column names and types.
Before Drill 1.4, Drill has to scan the underline data physically to get this information. If there are too many files on HDFS, or too many partitions in Hive tables, the performance of "limit 0" queries will become much worse.
Starting from Drill 1.4, Drill can utilize the schema information of certain views or Hive tables to quickly return the column names and types, bypassing reading the underline data files physically.

2. Examples

For Hive tables, it is easy to understand that Drill only needs to check Hive metastore to fetch the column names and types;

For Views with proper "cast" for each returning column, Drill just needs to check the view definition file to fetch the column names and types. Here is one example:

create view dfs.myview as
SELECT CAST(`col1` AS INTEGER) AS `col1`, CAST(`col2` AS VARCHAR) AS `col2`
FROM `dfs`.`example/mydirectory`;

Then below "limit 0" queries can benefit:

select * from dfs.myview limit 0;

SELECT * FROM 
(SELECT SUM(1) AS `COL` FROM dfs.myview HAVING COUNT(1)>0) 
T LIMIT 0;

3. Explain plan

Take query "select * from dfs.myview limit 0" for example:
Before Drill 1.4, the visualized plan is:

Basically it has the Scan and Limit operator for "limit 0" query.
Starting from Drill 1.4, the visualized plan is:

If the "limit 0" query benefits from this feature, the keyword in the plan is "DrillDirectScanRel".
Note: In Drill 1.6, the visualized plan changed to:

If the "limit 0" query benefits from this feature, the keyword in the plan is "org.apache.drill.exec.planner.sql.handlers.FindLimit0Visitor$RelDataTypeReader".

4. Performance difference

This is not an official benchmark result and it is just based on my lab test. So below performance number may not apply to other environments.
The test query "select * from dfs.myview limit 0" is on a view on HDFS based on 20K 1MB parquet file.
Before Drill 1.4, it takes several minutes to finish, and the time will grow with the number of files;
With Drill 1.4, it only takes 1 second no matter how many files are on HDFS.

Tuesday, January 12, 2016