Thursday, April 16, 2015

Pig job fails with ApplicationMaster OutOfMemoryError when writing parquet files.


Pig 0.13 on Yarn


  • A pig job which is reading and writing many parquet files, fails with ApplicationMaster OutOfMemoryError in the last commitJob phase.
  • All mappers and reducers finishes successfully.
  • ApplicationMaster container log shows below stacktrace: Setting job diagnostics to Job commit failed: java.lang.reflect.InvocationTargetException
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
        at java.lang.reflect.Method.invoke(
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(
        ... 5 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.StringCoding$StringEncoder.encode(
        at java.lang.StringCoding.encode(
        at java.lang.String.getBytes(
        at parquet.format.ColumnChunk.write(
        at parquet.format.RowGroup.write(
        at parquet.format.FileMetaData.write(
        at parquet.format.Util.write(
        at parquet.format.Util.writeFileMetaData(
        at parquet.hadoop.ParquetFileWriter.serializeFooter(
        at parquet.hadoop.ParquetFileWriter.writeMetadataFile(
        at parquet.hadoop.ParquetOutputCommitter.commitJob(
        ... 10 more

Root Cause:

Pig is using a parquet jar to read and write parquet files. The source code of the parquet jar comes from the parquet-mr github.
The logic is in
In this case during commitJob() phase, ApplicationMaster is calling ParquetOutputCommitter.commitJob().
It will firstly read all the footers of the output parquet files:
List<Footer> footers = ParquetFileReader.readAllFootersInParallel(configuration, outputStatus);
And then write the metadata into a file named "_metadata" in output directory:
ParquetFileWriter.writeMetadataFile(configuration, outputPath, footers);
If the output parquet files have large schema and the number of parquet files is huge, ApplicationMaster needs much memory during commitJob phase.
By default, the ApplicationMaster's memory configurations are: 1536 -Xmx1024m
If above memory is not enough, ApplicationMaster will fail with OOM error.


If _metadata file is needed, just increase and to large enough.

Else, starting from parquet 1.6.0 per PARQUET-107, configuration "parquet.enable.summary-metadata" was introduced to enable or disable metadata generation in the commintJob phase. So just run below command to disable metadata generation:
set parquet.enable.summary-metadata false;
Note: please make sure parquet-pig-bundle-<version>.jar is compiled from parquet 1.6.0 source code or above.  For example, twitter compiled parquet-pig-bundle jars here:

No comments:

Post a Comment

Popular Posts