Wednesday, April 22, 2015

Huge Pig job causes local /tmp directory runs out of disk space.


Huge Pig job causes local /tmp directory runs out of disk space.


Pig 0.13

Root cause:

Per PIG-1838, pig keeps the jar files for each job until the pig script finishes.
It means if a single pig script contains lots of MapReduce jobs, pig will create many jar files in /tmp directory on the node where the pig job is submitted. Until the whole pig script finishes, pig will then clean the temp jars.
For example, below pig job will keep 2 jars in /tmp directory until the whole pig job finishes, because it contains 2 MapReduce jobs.
a = load '/dir' using ParquetLoader();
b = order a by price ;
STORE b INTO '/output' USING parquet.pig.ParquetStorer; 
The temp jars in /tmp during execution:

If we put 2 of above pig jobs into one pig script, pig will keep 4 temp jars in /tmp:

Source Code analysis:
The logic is in pig source code --, which calls createTempFile() function in
File submitJarFile = File.createTempFile("Job", ".jar");"creating jar file "+submitJarFile.getName());
Per java source doe --, the directory location is controlled by
File tmpdir = (directory != null) ? directory : TempDirectory.location();

    private TempDirectory() { }

    // temporary directory location
    private static final File tmpdir = new File(fs.normalize(AccessController
        .doPrivileged(new GetPropertyAction(""))));
    static File location() {
        return tmpdir;


To avoid /tmp directory running of disk space, available solutions are:
1. Split a huge pig script into small pieces and run each piece separately.
2. Set to a directory with enough disk space in HADOOP_OPTS or PIG_OPTS before submitting the pig job.
For example:
export PIG_OPTS=""
pig test.pig

No comments:

Post a Comment

Popular Posts