Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2874

Dead lock in "DelegationTokenRenewer" which blocks RM to execute any further apps

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When token renewal fails and the application finishes this dead lock can occur
      Jstack dump :

      Found one Java-level deadlock:
      =============================
      "DelegationTokenRenewer #181865":
      waiting to lock monitor 0x0000000000900918 (object 0x00000000c18a9998, a java.util.Collections$SynchronizedSet),
      which is held by "DelayedTokenCanceller"
      "DelayedTokenCanceller":
      waiting to lock monitor 0x0000000004141718 (object 0x00000000c7eae720, a org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask),
      which is held by "Timer-4"
      "Timer-4":
      waiting to lock monitor 0x0000000000900918 (object 0x00000000c18a9998, a java.util.Collections$SynchronizedSet),
      which is held by "DelayedTokenCanceller"

      Java stack information for the threads listed above:
      ===================================================
      "DelegationTokenRenewer #181865":
      at java.util.Collections$SynchronizedCollection.add(Collections.java:1636)

      • waiting to lock <0x00000000c18a9998> (a java.util.Collections$SynchronizedSet)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.addTokenToList(DelegationTokenRenewer.java:322)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:398)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$500(DelegationTokenRenewer.java:70)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:657)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:638)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
        "DelayedTokenCanceller":
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask.cancel(DelegationTokenRenewer.java:443)
      • waiting to lock <0x00000000c7eae720> (a org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.removeApplicationFromRenewal(DelegationTokenRenewer.java:558)
      • locked <0x00000000c18a9998> (a java.util.Collections$SynchronizedSet)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$300(DelegationTokenRenewer.java:70)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelayedTokenRemovalRunnable.run(DelegationTokenRenewer.java:599)
        at java.lang.Thread.run(Thread.java:745)
        "Timer-4":
        at java.util.Collections$SynchronizedCollection.remove(Collections.java:1639)
      • waiting to lock <0x00000000c18a9998> (a java.util.Collections$SynchronizedSet)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.removeFailedDelegationToken(DelegationTokenRenewer.java:503)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$100(DelegationTokenRenewer.java:70)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask.run(DelegationTokenRenewer.java:437)
      • locked <0x00000000c7eae720> (a org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

      Found 1 deadlock.

      1. YARN-2874.20141118-2.patch
        2 kB
        Naganarasimha G R
      2. YARN-2874.20141118-1.patch
        2 kB
        Naganarasimha G R

        Issue Links

          Activity

          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Karthik Kambatla & Tsuyoshi Ozawa
          Thanks for reviewing and commiting the patch .

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Karthik Kambatla & Tsuyoshi Ozawa Thanks for reviewing and commiting the patch .
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks Naganarasimha for your contribution. Just committed this to trunk and branch-2.

          Show
          kasha Karthik Kambatla added a comment - Thanks Naganarasimha for your contribution. Just committed this to trunk and branch-2.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #6645 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6645/)
          YARN-2874. Dead lock in DelegationTokenRenewer which blocks RM to execute any further apps. (Naganarasimha G R via kasha) (kasha: rev 799353e2c7db5af6e40e3521439b5c8a3c5c6a51)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #6645 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6645/ ) YARN-2874 . Dead lock in DelegationTokenRenewer which blocks RM to execute any further apps. (Naganarasimha G R via kasha) (kasha: rev 799353e2c7db5af6e40e3521439b5c8a3c5c6a51) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          Hide
          kasha Karthik Kambatla added a comment -

          Checking this in.

          Show
          kasha Karthik Kambatla added a comment - Checking this in.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Naganarasimha G R, never mind, your patch looks good to me. +1

          Show
          ozawa Tsuyoshi Ozawa added a comment - Naganarasimha G R , never mind, your patch looks good to me. +1
          Hide
          kasha Karthik Kambatla added a comment -

          Sid's suggestion was to add a variable to Token to capture that it is to be cancelled, so it works across any duplicate TimerTasks as well. In any case, let us follow up on addressing that race in YARN-2919 and get this patch in first.

          The patch itself looks good to me. +1

          Show
          kasha Karthik Kambatla added a comment - Sid's suggestion was to add a variable to Token to capture that it is to be cancelled, so it works across any duplicate TimerTasks as well. In any case, let us follow up on addressing that race in YARN-2919 and get this patch in first. The patch itself looks good to me. +1
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Tsuyoshi Ozawa & Karthik Kambatla,
          Thanks for the review and feed back. I put some effort to write the test code to reproduce this issue but as more and more sleeps and wait notify was required and was not consistently going into deadlock, i thought its not worth the effort as the dead lock scenario was easily detectable.

          RenewalTimerTask is a method which has a side effect, so the state can be invalid after the patch. We need to update the long error handling before merging it.

          Was not so clear about this statement as i was not able to get which state gets invalidated because of the fix and further you ( Tsuyoshi Ozawa) had mentioned Rethinking of this, this is not related to this JIRA. , so please if any thing more needs to be updated for this issue please inform.

          Regarding Sid's comment in MAPREDUCE-5384, If required to be be handled IIUC i need to revert my patch and redo as below (correct me if wrong and also inform if its req to be fixed in this way)

          @Override
              public void run() {
                if (cancelled) {
                  return;
                }
                Token<?> token = dttr.token;
                try {
          	synchronized (this) {
                      if (cancelled) {
                        return;
                      }
          	  requestNewHdfsDelegationTokenIfNeeded(dttr);
          	  // if the token is not replaced by a new token, renew the token
          	  if (appTokens.get(dttr.applicationId).contains(dttr)) {
          	    renewToken(dttr);
          	    setTimerForTokenRenewal(dttr);// set the next one
          	  } else {
          	  LOG.info("The token was removed already. Token = [" +dttr +"]");
          	  }
          	}
                } catch (Exception e) {
                  LOG.error("Exception renewing token" + token + ". Not rescheduled", e);
                  removeFailedDelegationToken(dttr);
                }
              }
          
          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Tsuyoshi Ozawa & Karthik Kambatla , Thanks for the review and feed back. I put some effort to write the test code to reproduce this issue but as more and more sleeps and wait notify was required and was not consistently going into deadlock, i thought its not worth the effort as the dead lock scenario was easily detectable. RenewalTimerTask is a method which has a side effect, so the state can be invalid after the patch. We need to update the long error handling before merging it. Was not so clear about this statement as i was not able to get which state gets invalidated because of the fix and further you ( Tsuyoshi Ozawa ) had mentioned Rethinking of this, this is not related to this JIRA. , so please if any thing more needs to be updated for this issue please inform. Regarding Sid's comment in MAPREDUCE-5384 , If required to be be handled IIUC i need to revert my patch and redo as below (correct me if wrong and also inform if its req to be fixed in this way) @Override public void run() { if (cancelled) { return; } Token<?> token = dttr.token; try { synchronized (this) { if (cancelled) { return; } requestNewHdfsDelegationTokenIfNeeded(dttr); // if the token is not replaced by a new token, renew the token if (appTokens.get(dttr.applicationId).contains(dttr)) { renewToken(dttr); setTimerForTokenRenewal(dttr);// set the next one } else { LOG.info("The token was removed already. Token = [" +dttr +"]"); } } } catch (Exception e) { LOG.error("Exception renewing token" + token + ". Not rescheduled", e); removeFailedDelegationToken(dttr); } }
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12682188/YARN-2874.20141118-2.patch
          against trunk revision 185e0c7.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5985//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5985//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12682188/YARN-2874.20141118-2.patch against trunk revision 185e0c7. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5985//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5985//console This message is automatically generated.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Mading a patch available.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Mading a patch available.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Thanks for your comment, Karthik. I'm also okay with not adding the test case on this JIRA.

          That said, there is still a race along the lines of the one discusses in MAPREDUCE-5384.

          Sid's suggestion looks good to me.

          We need to update the long error handling before merging it.

          Rethinking of this, this is not related to this JIRA.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Thanks for your comment, Karthik. I'm also okay with not adding the test case on this JIRA. That said, there is still a race along the lines of the one discusses in MAPREDUCE-5384 . Sid's suggestion looks good to me. We need to update the long error handling before merging it. Rethinking of this, this is not related to this JIRA.
          Hide
          kasha Karthik Kambatla added a comment -

          For purposes of fixing the deadlock, the patch looks good to me. I am not sure how easy it is to add a test for this, so I am okay with not including one.

          That said, there is still a race along the lines of the one discusses in MAPREDUCE-5384. We should at least create a follow-up JIRA to fix this; Sid's suggestion - https://issues.apache.org/jira/browse/MAPREDUCE-5384?focusedCommentId=13745421&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13745421 - should work.

          Show
          kasha Karthik Kambatla added a comment - For purposes of fixing the deadlock, the patch looks good to me. I am not sure how easy it is to add a test for this, so I am okay with not including one. That said, there is still a race along the lines of the one discusses in MAPREDUCE-5384 . We should at least create a follow-up JIRA to fix this; Sid's suggestion - https://issues.apache.org/jira/browse/MAPREDUCE-5384?focusedCommentId=13745421&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13745421 - should work.
          Hide
          kasha Karthik Kambatla added a comment -

          I was looking into something similar for MR1 a while ago. Let me take a look at this.

          Show
          kasha Karthik Kambatla added a comment - I was looking into something similar for MR1 a while ago. Let me take a look at this.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          One more thing: dead lock itself can be fixed by the patch, but I have one concern - error handling while interrupted in RenewalTimerTask#run. RenewalTimerTask is a method which has a side effect, so the state can be invalid after the patch. We need to update the long error handling before merging it.

          Show
          ozawa Tsuyoshi Ozawa added a comment - One more thing: dead lock itself can be fixed by the patch, but I have one concern - error handling while interrupted in RenewalTimerTask#run. RenewalTimerTask is a method which has a side effect, so the state can be invalid after the patch. We need to update the long error handling before merging it.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Naganarasimha G R Thanks for you reporting. I dived into the code. I think this dead lock can be caused following code path:

          1. delayedRemovalThread.start > removeApplicationFromRenewal() > synchronize (delegationTokens) {} > dttr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent() > handleAppSubmitEvent > addTokenToList() > delegationTokens(Collections$SynchronizedSet)

          2. renewalTimer.schedule() > RenewalTimerTask#run > removeFailedDelegationToken > tr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent > handleAppSubmitEvent > addTokenToList > delegationTokens(Collections$SynchronizedSet)

          The current code path is as follows:

          1. delayedRemovalThread.start > removeApplicationFromRenewal() > synchronize(tokenSet){} > dttr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent() > handleAppSubmitEvent > appTokens.get(applicationId).add(dtr) # appTokens.get looks same to tokenSet

          2. renewalTimer.schedule() > RenewalTimerTask#run > removeFailedDelegationToken > tr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent > handleAppSubmitEvent > appTokens.get(applicationId).add(dtr)

          The cause of this issue is that the lock order between tokenSet and timerTask. I think the fix by Naganarasimha works well in this case. Jason Lowe, Karthik Kambatla, please let me know if I'm wrong.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Naganarasimha G R Thanks for you reporting. I dived into the code. I think this dead lock can be caused following code path: 1. delayedRemovalThread.start > removeApplicationFromRenewal() > synchronize (delegationTokens) {} > dttr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent() > handleAppSubmitEvent > addTokenToList() > delegationTokens(Collections$SynchronizedSet) 2. renewalTimer.schedule() > RenewalTimerTask#run > removeFailedDelegationToken > tr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent > handleAppSubmitEvent > addTokenToList > delegationTokens(Collections$SynchronizedSet) The current code path is as follows: 1. delayedRemovalThread.start > removeApplicationFromRenewal() > synchronize(tokenSet){} > dttr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent() > handleAppSubmitEvent > appTokens.get(applicationId).add(dtr) # appTokens.get looks same to tokenSet 2. renewalTimer.schedule() > RenewalTimerTask#run > removeFailedDelegationToken > tr.timerTask.cancel() > DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent > handleAppSubmitEvent > appTokens.get(applicationId).add(dtr) The cause of this issue is that the lock order between tokenSet and timerTask. I think the fix by Naganarasimha works well in this case. Jason Lowe , Karthik Kambatla , please let me know if I'm wrong.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Jason Lowe & Karthik Kambatla,
          Hope the patch addresses the issue reported so shall i change the status to "patch available" or test case is also required for this ... ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Jason Lowe & Karthik Kambatla , Hope the patch addresses the issue reported so shall i change the status to "patch available" or test case is also required for this ... ?
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Updated patch with fixes for review comment

          Show
          Naganarasimha Naganarasimha G R added a comment - Updated patch with fixes for review comment
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Sorry was suppose to remove synchronized from both methods signature (RenewalTimerTask.run and RenewalTimerTask.cancel ). Will reupload the patch again ...

          Show
          Naganarasimha Naganarasimha G R added a comment - Sorry was suppose to remove synchronized from both methods signature (RenewalTimerTask.run and RenewalTimerTask.cancel ). Will reupload the patch again ...
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the report and the patch, Naganarasimha G R. However I don't see how the provided patch will resolve the deadlock issue. It only changed a boolean to an AtomicBoolean which in itself won't resolve an existing deadlock scenario. With the introduction of a concurrent data structure like AtomicBoolean I would expect some existing locks to be removed as a result.

          Show
          jlowe Jason Lowe added a comment - Thanks for the report and the patch, Naganarasimha G R . However I don't see how the provided patch will resolve the deadlock issue. It only changed a boolean to an AtomicBoolean which in itself won't resolve an existing deadlock scenario. With the introduction of a concurrent data structure like AtomicBoolean I would expect some existing locks to be removed as a result.

            People

            • Assignee:
              Naganarasimha Naganarasimha G R
              Reporter:
              Naganarasimha Naganarasimha G R
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development