Comments on Open Knowledge Base: Difference between hive.mapjoin.smalltable.filesize and hive.auto.convert.join.noconditionaltask.size

Great post thannks

2024-06-25T17:38:58.476-07:00

Great post thannks

check my latest reply below.

2016-03-08T17:40:11.887-08:00

check my latest reply below.

If you do not want "conditional task", t...

2016-03-07T18:02:39.440-08:00

If you do not want "conditional task", the only way is to make sure hive.auto.convert.join.noconditionaltask.size is large enough.
Check the code ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java:

// If sizes of at least n-1 tables in a n-way join is known, and their sum is smaller than
// the threshold size, convert the join into map-join and don't create a conditional task
boolean convertJoinMapJoin = HiveConf.getBoolVar(conf,
HiveConf.ConfVars.HIVECONVERTJOINNOCONDITIONALTASK);

Then:
MapRedTask newTask = convertTaskToMapJoinTask(currTask.getWork(), bigTablePosition);

newTask.setTaskTag(Task.MAPJOIN_ONLY_NOBACKUP);
replaceTask(currTask, newTask, physicalContext);

After that , the code logic is to check individual table size VS hive.mapjoin.smalltable.filesize. And add possible mapjoin task into "conditional task".
The reason why they did not remove "common join" is probably because the statistics may not be accurate during planning time. Hive want to make the final decision at runtime according to the size of each table(or results).

Also, when we talk of the "conditional task&q...

2016-03-06T12:08:44.905-08:00

Also, when we talk of the "conditional task", do we mean checking the size of the small table or checking the size of the n-1 table/partitions? The reason why i'm confused with these 2 parameters is that i do not understand the need for hive.auto.convert.join.noconditionaltask.size, when hive.mapjoin.smalltable.filesize is already serving the purpose in an n way join.

~Abhilash

Thanks for your prompt reply. So why is it that i...

2016-03-06T12:02:24.490-08:00

Thanks for your prompt reply.

So why is it that in the 2 way join example (with hive.auto.convert.join.noconditionaltask.size=10000000 and hive.mapjoin.smalltable.filesize=31000000) does the query plan output show both stage 1 and stage 5? Shouldn't the plan show only the stage 5?

Thanks,
Abhilash

If you think like a query planner, of course, all ...

2016-03-05T23:39:54.088-08:00

If you think like a query planner, of course, all plans should be in consideration.
However the explain plan output may not print all possible plans due to the value of some parameters such as hive.auto.convert.join.noconditionaltask.size and hive.mapjoin.smalltable.filesize.

Look at Plan a in 2-way join, the map join plan is not printed simply because: the smaller table size(30MB) > hive.mapjoin.smalltable.filesize.

Shouldn't all the plans created be the same fo...

2016-03-05T23:31:15.924-08:00

Shouldn't all the plans created be the same for a particular query? Which means the Hive driver create permutation of all possible ways in which queries could be run? like the plan for the second query, which draws out plans for both common join and a map join. Of course, its only during execution, will a certain path in the query plan(map join or common join) be executed. the actual selection of the path could be based off of hive.auto.convert.join.noconditionaltask.size and
hive.mapjoin.smalltable.filesize

Also, i'm a little confused with the differences in the above properties. Could you elaborate a little. I couldn't get any help online. Its the same copy pasted thing everywhere.

Thanks
Abhilash