Tuesday, May 2, 2017

Pig job fails with "org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120"

Symptom:

Pig job fails with "org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120".

Env:

Pig 0.16
Hadoop 2.7.0

Root Cause:

mapreduce.job.counters.max controls the limit on the number of counters allowed per job.
Here are two(or more) possible reasons for this error, but each with different stacktrace.

1. MapReduce job fails with because this job has more than this limit of counters.

Sample stacktrace is:
2017-05-02 11:40:13,701 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: Error:       org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120
at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101)
at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108)
at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78)
at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95)
at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounterImpl(AbstractCounterGroup.java:123)
at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:113)
at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:130)
at org.apache.hadoop.mapred.Counters$Group.findCounter(Counters.java:369)
at org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:314)
at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:479)
at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:667)
at org.apache.hadoop.mapreduce.task.reduce.DirectShuffleFetcher.<init>(DirectShuffleFetcher.java:102)
at org.apache.hadoop.mapreduce.task.reduce.DirectShuffle.run(DirectShuffle.java:117)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
How to reproduce?
Just "set mapreduce.job.counters.max 1; " in a sample pig script.
And this mapreduce job will show as "FAILED" in ResourceManager UI.

2. Job History Server fails to retrieve the counter information from this MapReduce job because the job has more than the limit of counters.

Sample stacktrace is:
2017-05-02 14:57:12,419 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120
 at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:199)
 at org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:217)
 at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:209)
 at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:205)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
 at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:205)
 at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:242)
 at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
 at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2032)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2030)
Caused by: org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120
 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101)
 at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108)
 at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78)
 at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95)
 at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:87)
 at org.apache.hadoop.mapreduce.jobhistory.EventReader.fromAvro(EventReader.java:197)
 at org.apache.hadoop.mapreduce.jobhistory.ReduceAttemptFinishedEvent.setDatum(ReduceAttemptFinishedEvent.java:168)
 at org.apache.hadoop.mapreduce.jobhistory.EventReader.getNextEvent(EventReader.java:173)
 at org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:112)
 at org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:154)
 at org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:140)
 at org.apache.hadoop.mapreduce.v2.hs.CompletedJob.loadFullHistoryData(CompletedJob.java:348)
 at org.apache.hadoop.mapreduce.v2.hs.CompletedJob.<init>(CompletedJob.java:101)
 at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.loadJob(HistoryFileManager.java:417)
 at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.loadJob(CachedHistoryStorage.java:180)
 at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.access$000(CachedHistoryStorage.java:52)
 at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage$1.load(CachedHistoryStorage.java:103)
 at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage$1.load(CachedHistoryStorage.java:100)
 at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
 at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
 at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4880)
 at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:193)
 ... 18 more

 at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:343)
 at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:428)
 at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:612)
 at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186)
 at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:269)
 at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
 at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
 at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:380)
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:308)
 at org.apache.pig.PigServer.launchPlan(PigServer.java:1487)
 at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1472)
 at org.apache.pig.PigServer.execute(PigServer.java:1461)
 at org.apache.pig.PigServer.access$500(PigServer.java:118)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1786)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:720)
 at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1075)
 at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
 at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
 at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
 at org.apache.pig.Main.run(Main.java:567)
 at org.apache.pig.Main.main(Main.java:178)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
How to reproduce?
Set below in mapred-site.xml on the node where Job History Server is running:
  <property>
      <name>mapreduce.job.counters.max</name>
      <value>1</value>
  </property>
And then restart Job History Server: on a MapR env, the command is:
maprcli node services -name historyserver -action restart -nodes `hostname -f`
Differently, this MapReduce job will show as "SUCCEEDED" in ResourceManager UI.

Solution:

Although above 2 scenarios show the same error message in different stacktrace, the solutions are different.
For #1, the fix is to increase mapreduce.job.counters.max on client side.
For example, put "set mapreduce.job.counters.max 1000;" in pig script itself;
Or put below in mapred-site.xml on the client node where the pig script is fired:
  <property>
      <name>mapreduce.job.counters.max</name>
      <value>1000</value>
  </property>

For #2, the fix is to increase mapreduce.job.counters.max for Job History Server.
For example, put below in mapred-site.xml on the node where Job History Server is running and make sure you restart Job History Server after that to make the change take effect:
  <property>
      <name>mapreduce.job.counters.max</name>
      <value>1000</value>
  </property>

In a common situation, users may firstly meet issue #1, but after fixing issue #1, issue #2 then shows up.
I would suggest if we know our MapReduce jobs could be complex in terms of number of counters, we can increase this mapreduce.job.counters.max everywhere including client and server sides, and then restart all related daemon processes to make sure everything is in sync.

No comments:

Post a Comment

Popular Posts