Thursday, July 19, 2018

Drill query fails with "AGGR OOM at First Phase" when doing Hash Aggregate

Symptom:

Drill query fails with "AGGR OOM at First Phase" when doing Hash Aggregate.
Sample error message or stacktrace is:
2018-01-01 11:11:11,111 [xxx:frag:5:6] INFO  o.a.d.e.w.fragment.FragmentExecutor - User Error Occurred: One or more nodes ran out of memory while executing the query. (AGGR OOM at First Phase. Partitions: 1. Estimated batch size: 57655296. values size: 65536. Output alloc size: 65536 Memory limit: 41943040 so far allocated: 262144. )
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.

AGGR OOM at First Phase. Partitions: 1. Estimated batch size: 57655296. values size: 65536. Output alloc size: 65536 Memory limit: 41943040 so far allocated: 262144.

[Error Id: yyy ]
        at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633) ~[drill-common-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:243) [drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.13.0-mapr.jar:1.13.0-mapr]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: AGGR OOM at First Phase. Partitions: 1. Estimated batch size: 57655296. values size: 65536. Output alloc size: 65536 Memory limit: 41943040 so far allocated: 262144.
        at org.apache.drill.exec.test.generated.HashAggregatorGen5.spillIfNeeded(HashAggTemplate.java:1419) ~[na:na]
        at org.apache.drill.exec.test.generated.HashAggregatorGen5.doSpill(HashAggTemplate.java:1381) ~[na:na]
        at org.apache.drill.exec.test.generated.HashAggregatorGen5.checkGroupAndAggrValues(HashAggTemplate.java:1281) ~[na:na]
        at org.apache.drill.exec.test.generated.HashAggregatorGen5.doWork(HashAggTemplate.java:592) ~[na:na]
        at org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:176) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:93) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:233) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:226) ~[drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]
        at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_171]
        at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_171]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1633) ~[hadoop-common-2.7.0-mapr-1710.jar:na]
        at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:226) [drill-java-exec-1.13.0-mapr.jar:1.13.0-mapr]

Env:

Drill 1.13

Root Cause:

One parameter named "planner.memory.min_memory_per_buffered_op" was introduced for spill to disk feature for Hash Join and Hash Aggregate per https://issues.apache.org/jira/browse/DRILL-5669 starting in Drill 1.11.
This parameter is to enforce a minimum memory allocation for operator, default is 40MB.
Based on above error message, the estimated for the batch size is 57MB (57655296) which is larger than the default 40MB setting for "planner.memory.min_memory_per_buffered_op".

Solution:

Increase planner.memory.min_memory_per_buffered_op to a larger value, for example, 67108864 (64MB).
You may also need to increase the JAVA direct memory for each drillbit because the memory usage for this query may be huge.

In the future release Drill 1.14,  a large number of operators will ensure that emitted batches don't exceed 16MB (current default) in size. So it will pack less rows if it hits that limit.
As a result, Drill 1.14 can help alleviate the issue.

No comments:

Post a Comment