Wednesday, June 29, 2016

How to troubleshoot Yarn job failure issue

Goal:

This article explains the troubleshooting methodology for Yarn job failure issue for newcomers.

Solution:

No matter it is Hive on Yarn or Spark on Yarn, or anything else based on Yarn execution engine, the troubleshooting methodology is the same.
The key is always AM(ApplicationMaster) log.
Let's repeat 3 times for important things: AM, AM, AM:)
Ask below 5 questions to start troubleshooting:

1. What is the problematic Yarn application ID?

This can be found from the client log, eg Hive log, Spark log or custom application log.
For example, normally we can get below information at least:
application_1111111111111_12345
tracking URL: http://RM:8088/proxy/application_1111111111111_12345/

2. On which node does AM container run?

This can be found from RM log or RM UI for that Yarn job.
Most likely the AM container is the first container of that Yarn job, unless that AM container fails once.
Just search the application ID -- "1111111111111_12345" from RM log:
INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: 
Setting up container Container: 
[ContainerId: container_e87_1111111111111_12345_01_000001, NodeId: node1:43722, NodeHttpAddress: node1:8042, 
Resource: <memory:1024, vCores:1, disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.1.11.111:43722 }, ] 
for AM appattempt_1111111111111_12345_000001
We know that node1 runs AM container: container_e87_1111111111111_12345_01_000001

3. Which container(s) fail?


This can be found from AM container log.
AM container is just a special container, it could fail also.
AM container log is the key. In most cases, it can tell you which container fails and why.
We assume container_e87_1111111111111_12345_01_000123 on node2 fails.

4. What is the error in the failed container log?

Again, this failed container can be the AM container itself also.
In this example, we assume it is container_e87_1111111111111_12345_01_000123.
To find where is the failed container, check the RM log by "grep container_e87_1111111111111_12345_01_000123".

5. What is the error in the NM log for that failed container?

Here is the node2 NM log.
If the container failed because of "Virtual/physical memory checker", it will show in NM log.

Key takeaways:

AM container is the brain of the yarn job. It controls the whole life cycle of this job.
For any yarn job failure issue or performance issue, always start checking AM log.

No comments:

Post a Comment

Popular Posts