Thursday, November 27, 2014

Hive query runs out of heap memory when shuffle in memory


Hive query fails with out of memory errors when doing "shuffleInMemory":
Error: java.lang.OutOfMemoryError: 
Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory( 
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutputFromFile( 
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput( 
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$

Root Cause:

When Hive query is doing shuffle phase in MapReduce, it tries to copy map outputs to reducer.
The memory used in this case is:
mapred.job.shuffle.input.buffer.percent(default is 0.70) * Max heap size(-Xmx in
Currently the code will not check if there is enough heap memory for shuffle phase, so it may run out of heap memory.
One case is when Hive query is to select many columns.


1. Increase after fully understanding the memory usage on the whole cluster.

Please refer to this article for details about Five Steps to Avoiding Java Heap Space Errors.
Do not blindly increase this memory setting since it may cause other service or jobs running out of memory.
In hive shell:

2. Decrease mapred.job.shuffle.input.buffer.percent from default 0.70 to 0.20 for example.

In hive shell:
set mapred.job.shuffle.input.buffer.percent=0.20;

No comments:

Post a Comment

Popular Posts