Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7962

Race Condition When Stopping DelegationTokenRenewer causes RM crash during failover

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.2.0, 3.1.1
    • Component/s: resourcemanager
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      https://github.com/apache/hadoop/blob/69fa81679f59378fd19a2c65db8019393d7c05a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java

        private ThreadPoolExecutor renewerService;
      
        private void processDelegationTokenRenewerEvent(
            DelegationTokenRenewerEvent evt) {
          serviceStateLock.readLock().lock();
          try {
            if (isServiceStarted) {
              renewerService.execute(new DelegationTokenRenewerRunnable(evt));
            } else {
              pendingEventQueue.add(evt);
            }
          } finally {
            serviceStateLock.readLock().unlock();
          }
        }
      
        @Override
        protected void serviceStop() {
          if (renewalTimer != null) {
            renewalTimer.cancel();
          }
          appTokens.clear();
          allTokens.clear();
          this.renewerService.shutdown();
      
      2018-02-21 11:18:16,253  FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
      java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable@39bddaf2 rejected from java.util.concurrent.ThreadPoolExecutor@5f71637b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15487]
      	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
      	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
      	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
      	at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.processDelegationTokenRenewerEvent(DelegationTokenRenewer.java:196)
      	at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.applicationFinished(DelegationTokenRenewer.java:734)
      	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.finishApplication(RMAppManager.java:199)
      	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:424)
      	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:65)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:177)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
      	at java.lang.Thread.run(Thread.java:745)
      

      What I think is going on here is that the serviceStop method is not setting the isServiceStarted flag to 'false'.

      Please update so that the serviceStop method grabs the serviceStateLock and sets isServiceStarted to false, before shutting down the renewerService thread pool, to avoid this condition.

        Attachments

        1. YARN-7962.7.patch
          3 kB
          Billie Rinaldi
        2. YARN-7962.6.patch
          4 kB
          Wangda Tan
        3. YARN-7962.4.patch
          4 kB
          BELUGA BEHR
        4. YARN-7962.3.patch
          4 kB
          Billie Rinaldi
        5. YARN-7962.2.patch
          3 kB
          BELUGA BEHR
        6. YARN-7962.1.patch
          2 kB
          BELUGA BEHR

          Issue Links

            Activity

              People

              • Assignee:
                belugabehr BELUGA BEHR
                Reporter:
                belugabehr BELUGA BEHR
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: