Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3374

hdfs' TestDelegationToken fails intermittently with a race condition

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.3
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:

      Description

      The testcase is failing because the MiniDFSCluster is shutdown before the secret manager can change the key, which calls system.exit with no edit streams available.

      
          [junit] 2012-05-04 15:03:51,521 WARN  common.Storage (FSImage.java:updateRemovedDirs(224)) - Removing storage dir /home/horton/src/hadoop/build/test/data/dfs/name1
          [junit] 2012-05-04 15:03:51,522 FATAL namenode.FSNamesystem (FSEditLog.java:fatalExit(388)) - No edit streams are accessible
          [junit] java.lang.Exception: No edit streams are accessible
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.fatalExit(FSEditLog.java:388)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.exitIfNoStreams(FSEditLog.java:407)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsAndStorageDir(FSEditLog.java:432)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsStreamsAndStorageDirs(FSEditLog.java:468)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1028)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logUpdateMasterKey(FSNamesystem.java:5641)
          [junit]     at org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey(DelegationTokenSecretManager.java:286)
          [junit]     at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:150)
          [junit]     at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.rollMasterKey(AbstractDelegationTokenSecretManager.java:174)
          [junit]     at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:385)
          [junit]     at java.lang.Thread.run(Thread.java:662)
          [junit] Running org.apache.hadoop.hdfs.security.TestDelegationToken
          [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
          [junit] Test org.apache.hadoop.hdfs.security.TestDelegationToken FAILED (crashed)
      
      1. HDFS-3374.trunk.patch
        1.0 kB
        Brandon Li
      2. HDFS-3374.patch
        3 kB
        Matt Foley
      3. HDFS-3374-branch-1.0.patch
        3 kB
        Matt Foley
      4. hdfs-3374.patch
        3 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          The patch fixes the synchronization for the renewer thread and has the test case shut down the threads before the namenode.

          Show
          Owen O'Malley added a comment - The patch fixes the synchronization for the renewer thread and has the test case shut down the threads before the namenode.
          Hide
          Todd Lipcon added a comment -

          Hi Owen. Is this issue not present in trunk? Seems like this is a branch-0.20-only patch.

          Show
          Todd Lipcon added a comment - Hi Owen. Is this issue not present in trunk? Seems like this is a branch-0.20-only patch.
          Hide
          Matt Foley added a comment -

          Rename Owen's patch as branch-1.0 patch.

          Show
          Matt Foley added a comment - Rename Owen's patch as branch-1.0 patch.
          Hide
          Matt Foley added a comment -

          Candidate patch for trunk. Haven't had adequate time to test it yet, but we'll let test-patch run on it.

          Show
          Matt Foley added a comment - Candidate patch for trunk. Haven't had adequate time to test it yet, but we'll let test-patch run on it.
          Hide
          Matt Foley added a comment -

          +1 on Owen's patch for branch-1 and branch-1.0. Committing to same.

          Leaving Jira open for completion of corresponding patch to trunk.

          Show
          Matt Foley added a comment - +1 on Owen's patch for branch-1 and branch-1.0. Committing to same. Leaving Jira open for completion of corresponding patch to trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525804/HDFS-3374.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525804/HDFS-3374.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Why does this merit an exception to the policy that we commit things to trunk first and the maintaining branch only after it is in trunk?

          Show
          Todd Lipcon added a comment - Why does this merit an exception to the policy that we commit things to trunk first and the maintaining branch only after it is in trunk?
          Hide
          Todd Lipcon added a comment -

          This is still only in branch-1 and not in trunk. Any plans to forward port?

          Also, jcarder noticed that this added a lock order inversion:

          • FSNamesystem.saveNamespace (holding FSN lock) calls DTSM.saveSecretManagerState (which takes DTSM lock)
          • ExpiredTokenRemover.run (holding DTSM lock) calls rollMasterKey calls updateCurrentKey calls logUpdateMasterKey which takes FSN lock

          So if there is a concurrent saveNamespace at the same tie as the expired token remover runs, it might make the NN deadlock.

          Show
          Todd Lipcon added a comment - This is still only in branch-1 and not in trunk. Any plans to forward port? Also, jcarder noticed that this added a lock order inversion: FSNamesystem.saveNamespace (holding FSN lock) calls DTSM.saveSecretManagerState (which takes DTSM lock) ExpiredTokenRemover.run (holding DTSM lock) calls rollMasterKey calls updateCurrentKey calls logUpdateMasterKey which takes FSN lock So if there is a concurrent saveNamespace at the same tie as the expired token remover runs, it might make the NN deadlock.
          Hide
          Brandon Li added a comment -

          The synchronization inside ExpiredTokenRemover.run() is unnecessary and could cause the deadlock.

          Uploaded a trunk patch to fix the test case.
          I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run().

          Show
          Brandon Li added a comment - The synchronization inside ExpiredTokenRemover.run() is unnecessary and could cause the deadlock. Uploaded a trunk patch to fix the test case. I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run().
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12567460/HDFS-3374.trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567460/HDFS-3374.trunk.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run().

          Can you please do this in a separate jira?

          Show
          Suresh Srinivas added a comment - I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run(). Can you please do this in a separate jira?
          Hide
          Brandon Li added a comment -

          Sure. Created HDFS-4466 to remove the possible deadlock in branch-1.

          Show
          Brandon Li added a comment - Sure. Created HDFS-4466 to remove the possible deadlock in branch-1.
          Hide
          Brandon Li added a comment -

          The race condition in TestDelegationToken doesn't exist in trunk or 2.0.
          This is because of the changed introduced in HDFS-2579 though HDFS-2579 was intended to fix a different issue.

          In trunk and 2.0, the editlog write(logUpdateMasterKey) is protected by a noInterruptsLock object.

          Show
          Brandon Li added a comment - The race condition in TestDelegationToken doesn't exist in trunk or 2.0. This is because of the changed introduced in HDFS-2579 though HDFS-2579 was intended to fix a different issue. In trunk and 2.0, the editlog write(logUpdateMasterKey) is protected by a noInterruptsLock object.
          Hide
          Suresh Srinivas added a comment -

          +1 for the trunk patch.

          Show
          Suresh Srinivas added a comment - +1 for the trunk patch.
          Hide
          Brandon Li added a comment -

          Please ignore the trunk patch. The test issue doesn't exist in either trunk or 2.0.

          Show
          Brandon Li added a comment - Please ignore the trunk patch. The test issue doesn't exist in either trunk or 2.0.

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development