Wednesday, May 4, 2016

Drill Direct Scan optimization for count(*) queries on parquet files

Goal:

When running count(*) queries on parquet files, the physical plan may show below:
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@331f5e71...
This article is to explain what does "org.apache.drill.exec.store.pojo.PojoRecordReader" mean.

Root Cause:

For select count(*) or count( not-nullable-expr) queries on parquet files, Drill may do an optimization to read from Parquet metadata instead of reading the whole parquet file.
This logic is in code exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ConvertCountToDirectScan.java:
/**
 * This rule will convert
 *   " select count(*)  as mycount from table "
 * or " select count( not-nullable-expr) as mycount from table "
 *   into
 *
 *    Project(mycount)
 *         \
 *    DirectGroupScan ( PojoRecordReader ( rowCount ))
 *
 * or
 *    " select count(column) as mycount from table "
 *    into
 *      Project(mycount)
 *           \
 *            DirectGroupScan (PojoRecordReader (columnValueCount))
 *
 * Currently, only parquet group scan has the exact row count and column value count,
 * obtained from parquet row group info. This will save the cost to
 * scan the whole parquet files.
 */
For example:
create table dfs.tmp.`prune2` as select 1 as id from hive.bigtable;
refresh table metadata dfs.tmp.`prune2`;
select count(*) from  dfs.tmp.`prune2` ;
The physical plan contains below line:
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@4b7000eb[columns = null, isStarQuery = false, isSkipQuery = false]])

1 comment:

  1. The major reason as to why people get into business is to profit and expand their economic status. In the field of SEO business, the rates of returns are promising. Niche

    ReplyDelete

Popular Posts