Friday, September 2, 2016

How to override the Hive compression algorithm set in Hive

Goal:

Hive users may set different customized "mapred.output.compression.codec"(same as "mapreduce.output.fileoutputformat.compress.codec") in Hive Cli, Beeline, hive-site.xml or even Hive script files.  There could be thousands of such Hive script files.
This article explains how to override "mapred.output.compression.codec" globally without modifying each script file one by one.

Env:

Hive 1.2
Hadoop 2.7

Solution:

Say all Hive scripts are using Lzo compression algorithm right now:
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
If the Hadoop Admin want to override the compression algorithm to below situations:

1. org.apache.hadoop.io.compress.SnappyCodec

Put <final> tag in mapred-site.xml on all nodes:
<property>
  <name>mapred.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  <final>true</final>
</property>
After that, the YARN job container log should show below info:
INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/mapr/tmp/hive/mapr/593a64b1-16a4-4e96-9e32-a0066e10309d/hive_2016-09-02_16-56-10_136_8088252876833149660-1/-mr-10000/.hive-staging_hive_2016-09-02_16-56-10_136_8088252876833149660-1/_tmp.-ext-10001/000000_1.snappy
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]

2. org.apache.hadoop.io.compress.DefaultCodec

Leave the value empty and put <final> tag in mapred-site.xml on all nodes:
<property>
  <name>mapred.output.compression.codec</name>
  <value></value>
  <final>true</final>
</property>
After that, the YARN job container log should show below info:
INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/mapr/tmp/hive/mapr/94d62385-73ae-4770-b050-6f2f6b4f77ba/hive_2016-09-02_19-19-36_670_8404792977770566381-1/-mr-10000/.hive-staging_hive_2016-09-02_19-19-36_670_8404792977770566381-1/_tmp.-ext-10001/000000_0.deflate
INFO [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.deflate]

3. No compression at all

We have to remove the "SET hive.exec.compress.output=true;" from all hive scripts.
After that, the YARN job container log should show below info:
INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/mapr/tmp/hive/mapr/94d62385-73ae-4770-b050-6f2f6b4f77ba/hive_2016-09-02_19-23-58_054_204837380219085966-1/-mr-10000/.hive-staging_hive_2016-09-02_19-23-58_054_204837380219085966-1/_tmp.-ext-10001/000000_0

Note: Even after we override the mapred.output.compression.codec in mapred-site.xml, but from Hive CLI or Beeline, it is still showing "Lzo". That is fine, we can ignore that:
hive> set mapred.output.compression.codec;
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Because the real value is set in mapred-site.xml.
Below is the evidence from Yarn job container log:
WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.output.fileoutputformat.compress.codec;  Ignoring.



No comments:

Post a Comment

Popular Posts