Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5364

Deadlock between RenewalTimerTask methods cancel() and run()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      MAPREDUCE-4860 introduced a local variable cancelled in RenewalTimerTask to fix the race where DelegationTokenRenewal attempts to renew a token even after the job is removed. However, the patch also makes run() and cancel() synchronized methods leading to a potential deadlock against run()'s catch-block (error-path).

      The deadlock stacks below:

       - org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal$RenewalTimerTask.cancel() @bci=0, line=240 (Interpreted frame)
       - org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal.removeDelegationTokenRenewalForJob(org.apache.hadoop.mapreduce.JobID) @bci=109, line=319 (Interpreted frame)
      
       - org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal.removeFailedDelegationToken(org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal$DelegationTokenToRenew) @bci=62, line=297 (Interpreted frame)
       - org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal.access$300(org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal$DelegationTokenToRenew) @bci=1, line=47 (Interpreted frame)
       - org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal$RenewalTimerTask.run() @bci=148, line=234 (Interpreted frame)
      
      1. mr-5364-1.patch
        1 kB
        Karthik Kambatla
      2. mr-5364-addendum-1.patch
        3 kB
        Karthik Kambatla
      3. mr-5364-addendum-2.patch
        4 kB
        Karthik Kambatla

        Issue Links

          Activity

          Hide
          Karthik Kambatla added a comment -

          Thanks for your comments Sid. Opened MAPREDUCE-5384 to fix the race - the race doesn't cause any severe issues, even the possibility of a test failing are low. Will post a new patch addressing your comments on the new JIRA. Will mark this as resolved.

          Show
          Karthik Kambatla added a comment - Thanks for your comments Sid. Opened MAPREDUCE-5384 to fix the race - the race doesn't cause any severe issues, even the possibility of a test failing are low. Will post a new patch addressing your comments on the new JIRA. Will mark this as resolved.
          Hide
          Siddharth Seth added a comment -

          Looking at the code, I don't see a deadlock possibility. While a call to setTimerForTokenRenewal requires a lock on DelegationTokenRenewer.class, I don't see any method holding a lock on DelegationTokenRenewer.class requiring a lock on delegationTokens or cancelled flag. Am I missing something here?

          You're right. I was somehow considering removeDelegationTokenRenewalForJob to be a synchronized method. Sorry about that.

          This could be fixed via the original jira (MAPREDUCE-4860 or a new jira). The deadlock being resolved was the main issues in this jira which is already fixed. An extra renewal just leads to an additional exception message in the logs, correct ? or is it more severe than that (other than the failed unit test).

          Comments on the patch itself.
          The previous patch is likely better. One concern with the current patch - 'cancelled' is associated with the current RenewalTimerTask. If removeDelegationTokenRenewalForJob tries to cancel() while a token renewal is in progress - it effectively has no affect, since a new RenewalTimerTask would be scheduled. This may not be an issue since the reference to the DelegationTokenToRenew object will be removed from the list of delegationTokens. Since renew has been moved into DelegationTokenToRenew - I'd prefer having the cancel / intent to cancel associated with that as well.

          Show
          Siddharth Seth added a comment - Looking at the code, I don't see a deadlock possibility. While a call to setTimerForTokenRenewal requires a lock on DelegationTokenRenewer.class, I don't see any method holding a lock on DelegationTokenRenewer.class requiring a lock on delegationTokens or cancelled flag. Am I missing something here? You're right. I was somehow considering removeDelegationTokenRenewalForJob to be a synchronized method. Sorry about that. This could be fixed via the original jira ( MAPREDUCE-4860 or a new jira). The deadlock being resolved was the main issues in this jira which is already fixed. An extra renewal just leads to an additional exception message in the logs, correct ? or is it more severe than that (other than the failed unit test). Comments on the patch itself. The previous patch is likely better. One concern with the current patch - 'cancelled' is associated with the current RenewalTimerTask. If removeDelegationTokenRenewalForJob tries to cancel() while a token renewal is in progress - it effectively has no affect, since a new RenewalTimerTask would be scheduled. This may not be an issue since the reference to the DelegationTokenToRenew object will be removed from the list of delegationTokens. Since renew has been moved into DelegationTokenToRenew - I'd prefer having the cancel / intent to cancel associated with that as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12591288/mr-5364-addendum-2.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3841//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12591288/mr-5364-addendum-2.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3841//console This message is automatically generated.
          Hide
          Karthik Kambatla added a comment -

          Reopening to get the addendum reviewed and committed.

          Show
          Karthik Kambatla added a comment - Reopening to get the addendum reviewed and committed.
          Hide
          Karthik Kambatla added a comment -

          A cancelled flag could be used on the DelegationTokenToRenew structure itself. Set intent to cancel before attempting to cancel the timer task, and check this during renewal and before queuing another renewal.

          On second thought, I don't think we should a cancelled flag to DelegationTokenToRenew to address a synchronization in RenewalTimerTask.

          Thinking more about this, the cleanest approach seemed to be:

          1. Serialize timer cancellation and token renewal - via synchronization
          2. On successful token renewal, call setTimerForTokenRenewal

          Uploaded patch (addendum-2) to implement this. Also, moved the token renewal code from RenewalTimerTask#run to DelegationTokenToRenew#renew for clarity.

          Siddharth Seth, can you take a look at the latest patch.

          Show
          Karthik Kambatla added a comment - A cancelled flag could be used on the DelegationTokenToRenew structure itself. Set intent to cancel before attempting to cancel the timer task, and check this during renewal and before queuing another renewal. On second thought, I don't think we should a cancelled flag to DelegationTokenToRenew to address a synchronization in RenewalTimerTask. Thinking more about this, the cleanest approach seemed to be: Serialize timer cancellation and token renewal - via synchronization On successful token renewal, call setTimerForTokenRenewal Uploaded patch (addendum-2) to implement this. Also, moved the token renewal code from RenewalTimerTask#run to DelegationTokenToRenew#renew for clarity. Siddharth Seth , can you take a look at the latest patch.
          Hide
          Karthik Kambatla added a comment -

          Thanks Sid.

          The addendum patch can cause deadlocks on the call to setTimerForTokenRenewal

          Looking at the code, I don't see a deadlock possibility. While a call to setTimerForTokenRenewal requires a lock on DelegationTokenRenewer.class, I don't see any method holding a lock on DelegationTokenRenewer.class requiring a lock on delegationTokens or cancelled flag. Am I missing something here?

          A cancelled flag could be used on the DelegationTokenToRenew structure itself. Set intent to cancel before attempting to cancel the timer task, and check this during renewal and before queuing another renewal.

          I think I like this approach better - setTimerForTokenRenewal can be called conditionally based on the success of DelegationTokenToRenew#renew(). Let me take a stab.

          Show
          Karthik Kambatla added a comment - Thanks Sid. The addendum patch can cause deadlocks on the call to setTimerForTokenRenewal Looking at the code, I don't see a deadlock possibility. While a call to setTimerForTokenRenewal requires a lock on DelegationTokenRenewer.class, I don't see any method holding a lock on DelegationTokenRenewer.class requiring a lock on delegationTokens or cancelled flag. Am I missing something here? A cancelled flag could be used on the DelegationTokenToRenew structure itself. Set intent to cancel before attempting to cancel the timer task, and check this during renewal and before queuing another renewal. I think I like this approach better - setTimerForTokenRenewal can be called conditionally based on the success of DelegationTokenToRenew#renew() . Let me take a stab.
          Hide
          Siddharth Seth added a comment -

          The current patch (mr-5364-1.patch, which has been committed) looks ok at least in terms of getting rid of the deadlock. As Karthik pointed out, this doesn't completely fix what MAPREDUCE-4860 was trying to fix.
          The addendum patch can cause deadlocks on the call to

          setTimerForTokenRenewal

          . Moving that out of the synchronized block will just cause an additional renewal to be scheduled after the token is cancelled - so that doesn't help much either.

          A cancelled flag could be used on the DelegationTokenToRenew structure itself. Set intent to cancel before attempting to cancel the timer task, and check this during renewal and before queuing another renewal. There's multiple ways this could be fixed.

          Show
          Siddharth Seth added a comment - The current patch (mr-5364-1.patch, which has been committed) looks ok at least in terms of getting rid of the deadlock. As Karthik pointed out, this doesn't completely fix what MAPREDUCE-4860 was trying to fix. The addendum patch can cause deadlocks on the call to setTimerForTokenRenewal . Moving that out of the synchronized block will just cause an additional renewal to be scheduled after the token is cancelled - so that doesn't help much either. A cancelled flag could be used on the DelegationTokenToRenew structure itself. Set intent to cancel before attempting to cancel the timer task, and check this during renewal and before queuing another renewal. There's multiple ways this could be fixed.
          Hide
          Karthik Kambatla added a comment -

          Posting a patch that addresses the serialization of run()'s token renewal and cancel() via synchronizing on the Boolean cancelled. The catch-block in run() is left out of this synchronization to avoid the deadlock reported here.

          Locking:

          1. cancel() is called with a lock on delegationTokens.
          2. The synchronized block in run() doesn't call any method that requires a lock on delegationTokens. The only lock required is on DelgationTokenRenewer.class; however, the static synchronized methods there don't require a lock on RenewalTimerTask#cancelled.

          Arun C Murthy, can you please take a look at this patch? Do you have any cleaner approaches to this in mind?

          Show
          Karthik Kambatla added a comment - Posting a patch that addresses the serialization of run()'s token renewal and cancel() via synchronizing on the Boolean cancelled . The catch-block in run() is left out of this synchronization to avoid the deadlock reported here. Locking: cancel() is called with a lock on delegationTokens. The synchronized block in run() doesn't call any method that requires a lock on delegationTokens. The only lock required is on DelgationTokenRenewer.class ; however, the static synchronized methods there don't require a lock on RenewalTimerTask#cancelled. Arun C Murthy , can you please take a look at this patch? Do you have any cleaner approaches to this in mind?
          Hide
          Karthik Kambatla added a comment -

          Arun C Murthy, you are right. With the current fix, this fix re-exposes the race reported in MAPREDUCE-4860 albeit in a smaller window. Let me post an addendum patch here that fixes the original issue without introducing a deadlock.

          Show
          Karthik Kambatla added a comment - Arun C Murthy , you are right. With the current fix, this fix re-exposes the race reported in MAPREDUCE-4860 albeit in a smaller window. Let me post an addendum patch here that fixes the original issue without introducing a deadlock.
          Hide
          Alejandro Abdelnur added a comment -

          Arun C Murthy, I was about to do it but I've seen you just did, thx.

          Show
          Alejandro Abdelnur added a comment - Arun C Murthy , I was about to do it but I've seen you just did, thx.
          Hide
          Arun C Murthy added a comment -

          For now I've committed this to 1.2.1, lets' revisit MAPREDUCE-4860 for branch-1.3.

          Show
          Arun C Murthy added a comment - For now I've committed this to 1.2.1, lets' revisit MAPREDUCE-4860 for branch-1.3.
          Hide
          Arun C Murthy added a comment -

          IAC, should we commit this to 1.2.1 or revert MAPREDUCE-4860?

          Show
          Arun C Murthy added a comment - IAC, should we commit this to 1.2.1 or revert MAPREDUCE-4860 ?
          Hide
          Arun C Murthy added a comment -

          Looks like we could still have a race condn. where the check passes in run(), but is cancel is called b/w the check and the the doAs block?

          I'm very concerned about the original fix itself i.e. MAPREDUCE-4860, should we revisit it?

          Show
          Arun C Murthy added a comment - Looks like we could still have a race condn. where the check passes in run(), but is cancel is called b/w the check and the the doAs block? I'm very concerned about the original fix itself i.e. MAPREDUCE-4860 , should we revisit it?
          Hide
          Arun C Murthy added a comment -

          Alejandro Abdelnur Shouldn't this get into 1.2.1 too?

          Show
          Arun C Murthy added a comment - Alejandro Abdelnur Shouldn't this get into 1.2.1 too?
          Hide
          Alejandro Abdelnur added a comment -

          Thanks Karthik. Committed to branch-1.

          Show
          Alejandro Abdelnur added a comment - Thanks Karthik. Committed to branch-1.
          Hide
          Alejandro Abdelnur added a comment -

          +1, LGTM.

          Show
          Alejandro Abdelnur added a comment - +1, LGTM.
          Hide
          Karthik Kambatla added a comment -

          The patch doesn't apply because it is for branch-1.

          Show
          Karthik Kambatla added a comment - The patch doesn't apply because it is for branch-1.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12590296/mr-5364-1.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3820//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590296/mr-5364-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3820//console This message is automatically generated.
          Hide
          Karthik Kambatla added a comment -

          We observed that this happens only for very short jobs.

          To validate the patch, we ran a workload where we were seeing this in 1 out of 15 jobs on an average. With the patch, we ran 280 jobs without running into this.

          Added no tests, as the issue could not be reproduced deterministically.

          Show
          Karthik Kambatla added a comment - We observed that this happens only for very short jobs. To validate the patch, we ran a workload where we were seeing this in 1 out of 15 jobs on an average. With the patch, we ran 280 jobs without running into this. Added no tests, as the issue could not be reproduced deterministically.
          Hide
          Karthik Kambatla added a comment -

          Uploading a patch that uses an AtomicBoolean for cancelled and does away with the synchronized methods.

          Show
          Karthik Kambatla added a comment - Uploading a patch that uses an AtomicBoolean for cancelled and does away with the synchronized methods.

            People

            • Assignee:
              Karthik Kambatla
              Reporter:
              Karthik Kambatla
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development