Open Knowledge Base: How to control the file numbers of hive table after inserting data on MapR-FS.

Hive table contains files in HDFS, if one table or one partition has too many small files, the HiveQL performance may be impacted.
Sometimes, it may take lots of time to prepare a MapReduce job before submitting it, since Hive needs to get the metadata from each file.
This article explains how to control the file numbers of hive table after inserting data on MapRFS; or simply saying, it explains how many files will be generated for "target" table by below HiveQL:

INSERT OVERWRITE TABLE target SELECT * FROM source;

Above HiveQL may have below 2 major steps:

1. MapReduce(In this example, Map only) job to read the data from "source" table.

The number of Mappers determines the number of intermediate files, and the number of Mappers is determined by below 3 factors:

a. hive.input.format

Different input formats may start different number of Mappers in this step.
Default value in Hive 0.13 is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
It will combine all files together and then try to split, so that it can improve the performance if the table has too many small files.
One old default value is org.apache.hadoop.hive.ql.io.HiveInputFormat which will split each file separately. Eg, If you have 10 small files and each file only has 1 row, Hive may spawn 10 mappers to read the whole table.
This article is using the default CombineHiveInputFormat as example.

b. File split size

mapred.max.split.size and mapred.min.split.size control the "target" file split size.
(In latest Hadoop 2.x, those 2 parameters are deprecated. New ones are mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize).
For example, if one Hive table have one 1GB file, and the target split size is set to 100MB, 10 mappers MAY be spawned in this step. The reason of "MAY" is because of below factor c.

c. MapR-FS chunk size

Files in MapR-FS are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size.
The actual split size is max(target split size, chunk size).
Take above 1GB file with 100MB "target" split size for example, if the chunk size is 200MB, then the actual split size is 200MB, 5 mappers spawned; if the chunk size is 50MB, then the actual split size is 100MB, 10 mappers spawned.

Lab time:

Imagine here we have prepared 3 hive tables with the same size -- 644MB and only 1 file for each table.
The only difference is the chunk size of the 3 hive tables.
source -- chunk size=8GB.
source2 -- chunk size=256MB(default in mfs).
source3 -- chunk size=64k(Minimum).

# hadoop mfs -ls /user/hive/warehouse/|grep -i source
drwxr-xr-x Z U   - root root          1 2014-12-04 11:22 8589934592 /user/hive/warehouse/source
drwxr-xr-x Z U   - root root          1 2014-12-04 11:31  268435456 /user/hive/warehouse/source2
drwxr-xr-x Z U   - root root          1 2014-12-04 12:24      65536 /user/hive/warehouse/source3

Then the question is how many mappers will be spawned for below INSERT, after setting target split size to 100MB?

set mapred.max.split.size=104857600;
set mapred.min.split.size=104857600;
INSERT OVERWRITE TABLE target SELECT * FROM <source table>;

Results:
1. Table "source"
The whole table 644MB is in 1 chunk(8G each), so only 1 mapper.
2. Table "source2"
The whole table 644MB is in 3 chunks(256MB each), so 3 mappers.
3. Table "source3"
The whole table 644MB is in more than 10000 chunks(64KB each), and target split size(100MB) is larger than each chunk size 100MB, so 7 mappers.

Thinking accordingly, if the target split size is 10GB, for all above 3 tables, only 1 mapper will be spawned in the 1st step; If the target split size is 1MB, the mappers counts are : source(1), source2(3), source3(645).
After figuring out the 1st step, let's move to the 2nd step.

2. Small file merge MapReduce job

After the 1st MapReduce job finishes, Hive will decide if it needs to start another MapReduce job to merge the intermediate files. If small file merge is disabled, the number of target table files is the same as the number of mappers from 1st MapReduce job. Below 4 parameters determine if and how Hive does small file merge.

hive.merge.mapfiles -- Merge small files at the end of a map-only job.
hive.merge.mapredfiles -- Merge small files at the end of a map-reduce job.
hive.merge.size.per.task -- Size of merged files at the end of the job.
hive.merge.smallfiles.avgsize -- When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

By default hive.merge.smallfiles.avgsize=16000000 and hive.merge.size.per.task=256000000, so if the average file size is about 17MB, the merge job will not be triggered. Sometimes if we really want only 1 file being generated in the end, we need to increase hive.merge.smallfiles.avgsize to large enough to trigger the merge; and also you need to increase hive.merge.size.per.task to the get the needed number of files in the end.

Quiz time:

In hive 0.13 on MapR-FS with all default configurations, how many files will be generated in the end for below HiveQL? Have a guess on the file size in the end?
Reminder: chunk size for "source3" is 64KB.
1.

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=209715200;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
INSERT OVERWRITE TABLE target SELECT * FROM source3;

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=283115520;
set hive.merge.size.per.task=209715200;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
INSERT OVERWRITE TABLE target SELECT * FROM source3;

Answers:
1. Target split size is 65MB and chunk size is only 64KB, so 1st job will spawn 10 mappers and each mapper will generate one 65MB intermediate file.
Merge job will be triggered because average file size from previous job is less than 100MB(hive.merge.smallfiles.avgsize).
For each task, to achieve file size 200MB(hive.merge.size.per.task), 4 x 65MB files will be merged into one 260MB file.
So in the end, 3 files will be generated for target table -- 644MB = 260MB+260MB+124MB.

[root@n2a warehouse]# ls -altr target
total 659725
-rwxr-xr-x  1 root root 130296036 Dec  5 17:26 000002_0
-rwxr-xr-x  1 root root 272629772 Dec  5 17:26 000001_0
drwxr-xr-x  2 root root         3 Dec  5 17:26 .
-rwxr-xr-x  1 root root 272629780 Dec  5 17:26 000000_0
drwxr-xr-x 38 mapr mapr        37 Dec  5 17:26 ..

2. Target split size is 65MB and chunk size is only 64KB, so 1st job will spawn 10 mappers and each mapper will generate one 65MB intermediate file.
Merge job will be triggered because average file size from previous job is less than 270MB(hive.merge.smallfiles.avgsize).
For each task, to achieve file size 200MB(hive.merge.size.per.task), 4 x 65MB files *should* be merged into one 260MB file. However if so, the average file size is still less than 270MB(hive.merge.smallfiles.avgsize), so they are still considered as "small files".
In this case, 5 x 65MB files are merged into one 325MB file.
So in the end, 2 files will be generated for table table -- 644MB = 325MB+319MB.

[root@n1a warehouse]# ls -altr target
total 659724
-rwxr-xr-x  1 root root 334768396 Dec  8 10:46 000001_0
drwxr-xr-x  2 root root         2 Dec  8 10:46 .
-rwxr-xr-x  1 root root 340787192 Dec  8 10:46 000000_0
drwxr-xr-x 38 mapr mapr        37 Dec  8 10:46 ..

Key takeaways:

1. MapR-FS chunk size and target split size determine the number of mappers and the number of intermediate files.
2. Small file merge job controls the final number of files for target table.
3. Too many small files for one table may introduce job performance overhead.

Tuesday, December 9, 2014

How to control the file numbers of hive table after inserting data on MapR-FS.