Wednesday, May 4, 2016

Drill Direct Scan optimization for count(*) queries on parquet files


When running count(*) queries on parquet files, the physical plan may show below:
This article is to explain what does "" mean.

Root Cause:

For select count(*) or count( not-nullable-expr) queries on parquet files, Drill may do an optimization to read from Parquet metadata instead of reading the whole parquet file.
This logic is in code exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/
 * This rule will convert
 *   " select count(*)  as mycount from table "
 * or " select count( not-nullable-expr) as mycount from table "
 *   into
 *    Project(mycount)
 *         \
 *    DirectGroupScan ( PojoRecordReader ( rowCount ))
 * or
 *    " select count(column) as mycount from table "
 *    into
 *      Project(mycount)
 *           \
 *            DirectGroupScan (PojoRecordReader (columnValueCount))
 * Currently, only parquet group scan has the exact row count and column value count,
 * obtained from parquet row group info. This will save the cost to
 * scan the whole parquet files.
For example:
create table dfs.tmp.`prune2` as select 1 as id from hive.bigtable;
refresh table metadata dfs.tmp.`prune2`;
select count(*) from  dfs.tmp.`prune2` ;
The physical plan contains below line:
Scan(groupscan=[[columns = null, isStarQuery = false, isSkipQuery = false]])


  1. The major reason as to why people get into business is to profit and expand their economic status. In the field of SEO business, the rates of returns are promising. Niche

  2. An SEO agency can work together with a business to provide an added perspective, when it comes to understanding and developing marketing strategies for different sectors and various types of business websites. business startup If any online business wants to make its website catchy and generate the maximum traffic, then it is advisable to hire a good and reliable expert in this field.


Popular Posts