Thursday, September 3, 2015

ResourceManager fails to transition to Active mode with "InvalidResourceRequestException"

Env:

Hadoop 2.5.1
Apache Hadoop ResourceManager HA enabled.

Symptom:

ResourceManager fails to transition to Active mode with "InvalidResourceRequestException".

Below stacktrace shows firstly in RM log:
Caused by: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=9216, maxMemory=8192
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:228)
        at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:385)
        at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:345)
        at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:309)
        at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1104)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:508)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        ... 13 more
Below stacktrace then repeats in RM log:
WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122)
        at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
        at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:301)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:120)
        ... 4 more
Caused by: org.apache.hadoop.service.ServiceStateException: RMActiveServices cannot enter state STARTED from state STOPPED
        at org.apache.hadoop.service.ServiceStateModel.checkStateTransition(ServiceStateModel.java:129)
        at org.apache.hadoop.service.ServiceStateModel.enterState(ServiceStateModel.java:111)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:190)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:911)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:951)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:948)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:948)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:292)
        ... 5 more
2015-09-03 13:59:23,581 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session

Root Cause:

This is due to YARN-3493 which is fixed in Hadoop 2.6.1, 2.8.0 and 2.7.1.
This issue can happen if users lower the value of yarn.scheduler.maximum-allocation-mb and then restart ResourceManager.
ResourceManager fails to recover the applications left in RMStateStore which requires more memory than yarn.scheduler.maximum-allocation-mb, even though those applications failed for a long time.

Solution:

1. Identify the RMStateStore class.

MapR by default uses FileSystemRMStateStore which means the RMStateStore is on MFS.
User may choose ZKRMStateStore also.
$ hadoop2 conf |grep yarn.resourcemanager.store.class
<property><name>yarn.resourcemanager.store.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value><source>yarn-default.xml</source></property>

2. Find the location of RMStateStore.

If RMStateStore is using FileSystemRMStateStore, the parent location is defined by yarn.resourcemanager.fs.state-store.uri.
$ hadoop2 conf |grep  yarn.resourcemanager.fs.state-store.uri
<property><name>yarn.resourcemanager.fs.state-store.uri</name><value>/var/mapr/cluster/yarn/rm/system</value><source>yarn-default.xml</source></property>
Then the location of all application directories is :
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot

If RMStateStore is using ZKRMStateStore, the parent znode is defined by yarn.resourcemanager.zk-state-store.parent-path
$ hadoop2 conf |grep yarn.resourcemanager.zk-state-store.parent-path
<property><name>yarn.resourcemanager.zk-state-store.parent-path</name><value>/rmstore</value><source>yarn-default.xml</source></property>
Then the znode of all application directories is:
/rmstore/ZKRMStateRoot/RMAppRoot/

3. Move or remove all the application directories in RMStateStore.

The impact of this step is, RM UI will be clean, but the application information can still be view-able from HistoryServer UI; and also RM will not recover any failed/running applications so users need to re-submit the application.
For example:
If FileSystemRMStateStore,
hadoop fs -mv /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/* /backup_statestore/

If ZKRMStateStore,
Need to remove application directories one by one as below
rmr /rmstore/ZKRMStateRoot/RMAppRoot/application_#############_####

4. Restart ResourceManager 


1 comment:

Popular Posts