Wednesday, July 16, 2014

Troubleshoot Oozie MapReduce jobs

This articles provide troubleshooting steps for Oozie MapReduce job failure.
YARN is used in this example.
For example, if below Oozie MapReduce job fails, what logs need to check for RCA?
[root@admin]# oozie job -info 0000031-140711123346649-oozie-oozi-W
Job ID : 0000031-140711123346649-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf-pi
App Path      : hdfs://nameservice1/user/root/examples/apps/map-reduce_pi
Status        : KILLED
Run           : 0
User          : root
Group         : -
Created       : 2014-07-16 21:15 GMT
Started       : 2014-07-16 21:15 GMT
Last Modified : 2014-07-16 21:17 GMT
Ended         : 2014-07-16 21:17 GMT
CoordAction ID: -

Actions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                                            Status    Ext ID                 Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000031-140711123346649-oozie-oozi-W@:start:                                  OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0000031-140711123346649-oozie-oozi-W@mr-node                                  ERROR     job_1404818506021_0063 FAILED/KILLED-
------------------------------------------------------------------------------------------------------------------------------------
0000031-140711123346649-oozie-oozi-W@fail                                     OK        -                      OK         E0729
------------------------------------------------------------------------------------------------------------------------------------

1. Check Oozie log firstly.

oozie job -log 0000031-140711123346649-oozie-oozi-W

2. Check related MapReduce job log.

mapred job -logs job_1404818506021_0063

3. Check related map and reduce attempts logs.

Firstly identify the map and reduce attempts IDs.
[root@admin]# mapred job -list-attempt-ids job_1404818506021_0063 map completed
attempt_1404818506021_0063_m_000000_0
[root@admin]# mapred job -list-attempt-ids job_1404818506021_0063 reduce completed
Then check attempts log(s):
mapred job -logs job_1404818506021_0063 attempt_1404818506021_0063_m_000000_0

4. Check YARN container logs.

4.1 Firstly identify the YARN application ID and all of its children application IDs from Oozie web GUI.

4.2 Check all YARN application status to see which one of them failed.

[root@admin]# yarn application -status application_1404818506021_0063
14/07/16 15:56:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm148
Application Report :
 Application-Id : application_1404818506021_0063
 Application-Name : oozie:launcher:T=map-reduce:W=map-reduce-wf-pi:A=mr-node:ID=0000031-140711123346649-oozie-oozi-W
 Application-Type : MAPREDUCE
 User : root
 Queue : root.root
 Start-Time : 1405545360622
 Finish-Time : 1405545383113
 Progress : 100%
 State : FINISHED
 Final-State : SUCCEEDED
 Tracking-URL : http://admin.xxx.com:19888/jobhistory/job/job_1404818506021_0063
 RPC Port : 31561
 AM Host : hdw2.xxx.com
 Diagnostics :
[root@admin]# yarn application -status application_1404818506021_0064
14/07/16 16:15:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm148
Application Report :
 Application-Id : application_1404818506021_0064
 Application-Name : oozie:action:T=map-reduce:W=map-reduce-wf-pi:A=mr-node:ID=0000031-140711123346649-oozie-oozi-W
 Application-Type : MAPREDUCE
 User : root
 Queue : root.root
 Start-Time : 1405545382007
 Finish-Time : 1405545422206
 Progress : 100%
 State : FINISHED
 Final-State : FAILED
 Tracking-URL : http://admin.xxx.com:19888/jobhistory/job/job_1404818506021_0064
 RPC Port : 8699
 AM Host : hdw1.xxx.com
 Diagnostics : Task failed task_1404818506021_0064_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
We can see YARN application application_1404818506021_0064 failed.

4.3 Check the logs of the failed YARN application.

yarn logs  -applicationId application_1404818506021_0064
In this example, the root cause is in this YARN container logs:
WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.example.Q
uasiMonteCarlo$QmcMapper not found
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
 at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:722)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.example.QuasiMonteCarlo$QmcMapper not found
 at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
 ... 8 more

5. Check resourcemanager and nodemanger logs.

ResouceManager log is on the active resource manager.
We can identify the container on which the attempt task failed, and then check the nodemanager log.
For example:
[root@hdw2 ~]# ls /var/log/hadoop-yarn
container  hadoop-cmf-yarn-NODEMANAGER-hdw2.xxx.com.log.out  hadoop-cmf-yarn-RESOURCEMANAGER-xxx.viadea.com.log.out

No comments:

Post a Comment