Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3571

AM does not re-blacklist NMs after ignoring-blacklist event happens?

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.5.1
    • Fix Version/s: None
    • Labels:
      None

      Description

      Detailed analysis are in item "3 Will AM re-blacklist NMs after ignoring-blacklist event happens?" of below link:
      http://www.openkb.info/2015/05/when-will-application-master-blacklist.html

      The current behavior is : if that Node Manager has ever been blacklisted before, then it will not be blacklisted again after ignore-blacklist happens; Else, it will be blacklisted.
      However I think the right behavior should be : AM can re-blacklist NMs even after ignoring-blacklist happens once.

      The code logic is in function containerFailedOnHost(String hostName) of RMContainerRequestor.java:

        protected void containerFailedOnHost(String hostName) {
          if (!nodeBlacklistingEnabled) {
            return;
          }
          if (blacklistedNodes.contains(hostName)) {
            if (LOG.isDebugEnabled()) {
              LOG.debug("Host " + hostName + " is already blacklisted.");
            }
            return; //already blacklisted
      

      The reason of above behavior is in above item 2: when ignoring-blacklist happens, it only ask RM to clear "blacklistAdditions", however it dose not clear the "blacklistedNodes" variable.

      This behavior may cause the whole job/application to fail if the previous blacklisted NM was released after ignoring-blacklist event happens.
      Imagine a serial murder is released from prison just because the prison is 33% full, and horribly he/she will never be put in prison again. Only new murder will be put in prison.

      Example to prove:
      Test 1:
      One node(h4) has issue, other 3 nodes are healthy.
      The job failed with below AM logs:

      [root@h1 container_1430425729977_0006_01_000001]# egrep -i 'failures on node|blacklist|FATAL' syslog
      2015-05-02 18:38:41,246 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
      2015-05-02 18:38:41,246 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 1
      2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_0 - exited : java.io.IOException: Spill failed
      2015-05-02 18:39:07,297 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node h4.poc.com
      2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000008_0 - exited : java.io.IOException: Spill failed
      2015-05-02 18:39:07,954 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on node h4.poc.com
      2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000007_0 - exited : java.io.IOException: Spill failed
      2015-05-02 18:39:08,152 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on node h4.poc.com
      2015-05-02 18:39:08,152 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host h4.poc.com
      2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the blacklist for application_1430425729977_0006: blacklistAdditions=1 blacklistRemovals=0
      2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore blacklisting set to true. Known: 4, Blacklisted: 1, 25%
      2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the blacklist for application_1430425729977_0006: blacklistAdditions=0 blacklistRemovals=1
      2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_1 - exited : java.io.IOException: Spill failed
      2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000009_0 - exited : java.io.IOException: Spill failed
      2015-05-02 18:39:35,133 FATAL [IPC Server handler 5 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000008_1 - exited : java.io.IOException: Spill failed
      2015-05-02 18:39:57,308 FATAL [IPC Server handler 17 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_2 - exited : java.io.IOException: Spill failed
      2015-05-02 18:40:00,174 FATAL [IPC Server handler 10 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000009_1 - exited : java.io.IOException: Spill failed
      2015-05-02 18:40:00,227 FATAL [IPC Server handler 12 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000007_1 - exited : java.io.IOException: Spill failed
      2015-05-02 18:40:22,905 FATAL [IPC Server handler 3 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000018_0 - exited : java.io.IOException: Spill failed
      2015-05-02 18:40:24,413 FATAL [IPC Server handler 19 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000009_2 - exited : java.io.IOException: Spill failed
      2015-05-02 18:40:26,086 FATAL [IPC Server handler 16 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_3 - exited : java.io.IOException: Spill failed
      

      From above logs, we can see the node h4 got blacklisted after 3 task failures.
      Immediately after that, the igoring-blacklist event happened.
      Then node h4 will never be blacklisted again.
      When task 1430425729977_0006_m_000002 failed for 4 times, the whole job failed

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              haozhu Hao Zhu
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated: