Friday, May 8, 2015

How to avoid skew on reducer for "Group-By" on Hive

Env:

Hive 0.13

Symptom:

A "Group-By" query has heavy skew on one reducer.
For example, even if we set reducer number to 100 using below commands, one reducer takes hours to finish while other reducers only take seconds or minutes to finish.
MRv1:
set mapred.reduce.tasks=100;
MRv2:
set mapreduce.job.reduces=100;

By looking at the MR job statistics from JobTracker or ResourceManager web UI, "REDUCE_INPUT_RECORDS" are shown high on that reducer.

Root Cause:

By default Hive puts the data with the same group-by keys to the same reducer.
If the distinct value of the group-by columns has data skew, one reducer may get most of the shuffled data.
So that reducer takes much longer time to finish than other reducers.

Solution:

Set below configuration so that Hive will trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.
set hive.groupby.skewindata=true;
After setting it, the reducers' statistics should show data is evenly distributed to each reducer.

No comments:

Post a Comment

Popular Posts