Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-3374

hdfs' TestDelegationToken fails intermittently with a race condition

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.3
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:

      Description

      The testcase is failing because the MiniDFSCluster is shutdown before the secret manager can change the key, which calls system.exit with no edit streams available.

          [junit] 2012-05-04 15:03:51,521 WARN  common.Storage (FSImage.java:updateRemovedDirs(224)) - Removing storage dir /home/horton/src/hadoop/build/test/data/dfs/name1
          [junit] 2012-05-04 15:03:51,522 FATAL namenode.FSNamesystem (FSEditLog.java:fatalExit(388)) - No edit streams are accessible
          [junit] java.lang.Exception: No edit streams are accessible
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.fatalExit(FSEditLog.java:388)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.exitIfNoStreams(FSEditLog.java:407)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsAndStorageDir(FSEditLog.java:432)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsStreamsAndStorageDirs(FSEditLog.java:468)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1028)
          [junit]     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logUpdateMasterKey(FSNamesystem.java:5641)
          [junit]     at org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey(DelegationTokenSecretManager.java:286)
          [junit]     at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:150)
          [junit]     at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.rollMasterKey(AbstractDelegationTokenSecretManager.java:174)
          [junit]     at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:385)
          [junit]     at java.lang.Thread.run(Thread.java:662)
          [junit] Running org.apache.hadoop.hdfs.security.TestDelegationToken
          [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
          [junit] Test org.apache.hadoop.hdfs.security.TestDelegationToken FAILED (crashed)
      
      1. hdfs-3374.patch
        3 kB
        Owen O'Malley
      2. HDFS-3374-branch-1.0.patch
        3 kB
        Matt Foley
      3. HDFS-3374.patch
        3 kB
        Matt Foley
      4. HDFS-3374.trunk.patch
        1.0 kB
        Brandon Li

        Issue Links

          Activity

          Hide
          brandonli Brandon Li added a comment -

          Please ignore the trunk patch. The test issue doesn't exist in either trunk or 2.0.

          Show
          brandonli Brandon Li added a comment - Please ignore the trunk patch. The test issue doesn't exist in either trunk or 2.0.
          Hide
          sureshms Suresh Srinivas added a comment -

          +1 for the trunk patch.

          Show
          sureshms Suresh Srinivas added a comment - +1 for the trunk patch.
          Hide
          brandonli Brandon Li added a comment -

          The race condition in TestDelegationToken doesn't exist in trunk or 2.0.
          This is because of the changed introduced in HDFS-2579 though HDFS-2579 was intended to fix a different issue.

          In trunk and 2.0, the editlog write(logUpdateMasterKey) is protected by a noInterruptsLock object.

          Show
          brandonli Brandon Li added a comment - The race condition in TestDelegationToken doesn't exist in trunk or 2.0. This is because of the changed introduced in HDFS-2579 though HDFS-2579 was intended to fix a different issue. In trunk and 2.0, the editlog write(logUpdateMasterKey) is protected by a noInterruptsLock object.
          Hide
          brandonli Brandon Li added a comment -

          Sure. Created HDFS-4466 to remove the possible deadlock in branch-1.

          Show
          brandonli Brandon Li added a comment - Sure. Created HDFS-4466 to remove the possible deadlock in branch-1.
          Hide
          sureshms Suresh Srinivas added a comment -

          I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run().

          Can you please do this in a separate jira?

          Show
          sureshms Suresh Srinivas added a comment - I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run(). Can you please do this in a separate jira?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12567460/HDFS-3374.trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567460/HDFS-3374.trunk.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3925//console This message is automatically generated.
          Hide
          brandonli Brandon Li added a comment -

          The synchronization inside ExpiredTokenRemover.run() is unnecessary and could cause the deadlock.

          Uploaded a trunk patch to fix the test case.
          I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run().

          Show
          brandonli Brandon Li added a comment - The synchronization inside ExpiredTokenRemover.run() is unnecessary and could cause the deadlock. Uploaded a trunk patch to fix the test case. I will upload a branch-1 patch to remove the synchronization in ExpiredTokenRemover.run().
          Hide
          tlipcon Todd Lipcon added a comment -

          This is still only in branch-1 and not in trunk. Any plans to forward port?

          Also, jcarder noticed that this added a lock order inversion:

          • FSNamesystem.saveNamespace (holding FSN lock) calls DTSM.saveSecretManagerState (which takes DTSM lock)
          • ExpiredTokenRemover.run (holding DTSM lock) calls rollMasterKey calls updateCurrentKey calls logUpdateMasterKey which takes FSN lock

          So if there is a concurrent saveNamespace at the same tie as the expired token remover runs, it might make the NN deadlock.

          Show
          tlipcon Todd Lipcon added a comment - This is still only in branch-1 and not in trunk. Any plans to forward port? Also, jcarder noticed that this added a lock order inversion: FSNamesystem.saveNamespace (holding FSN lock) calls DTSM.saveSecretManagerState (which takes DTSM lock) ExpiredTokenRemover.run (holding DTSM lock) calls rollMasterKey calls updateCurrentKey calls logUpdateMasterKey which takes FSN lock So if there is a concurrent saveNamespace at the same tie as the expired token remover runs, it might make the NN deadlock.
          Hide
          tlipcon Todd Lipcon added a comment -

          Why does this merit an exception to the policy that we commit things to trunk first and the maintaining branch only after it is in trunk?

          Show
          tlipcon Todd Lipcon added a comment - Why does this merit an exception to the policy that we commit things to trunk first and the maintaining branch only after it is in trunk?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525804/HDFS-3374.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525804/HDFS-3374.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2384//console This message is automatically generated.
          Hide
          mattf Matt Foley added a comment -

          +1 on Owen's patch for branch-1 and branch-1.0. Committing to same.

          Leaving Jira open for completion of corresponding patch to trunk.

          Show
          mattf Matt Foley added a comment - +1 on Owen's patch for branch-1 and branch-1.0. Committing to same. Leaving Jira open for completion of corresponding patch to trunk.
          Hide
          mattf Matt Foley added a comment -

          Candidate patch for trunk. Haven't had adequate time to test it yet, but we'll let test-patch run on it.

          Show
          mattf Matt Foley added a comment - Candidate patch for trunk. Haven't had adequate time to test it yet, but we'll let test-patch run on it.
          Hide
          mattf Matt Foley added a comment -

          Rename Owen's patch as branch-1.0 patch.

          Show
          mattf Matt Foley added a comment - Rename Owen's patch as branch-1.0 patch.
          Hide
          tlipcon Todd Lipcon added a comment -

          Hi Owen. Is this issue not present in trunk? Seems like this is a branch-0.20-only patch.

          Show
          tlipcon Todd Lipcon added a comment - Hi Owen. Is this issue not present in trunk? Seems like this is a branch-0.20-only patch.
          Hide
          owen.omalley Owen O'Malley added a comment -

          The patch fixes the synchronization for the renewer thread and has the test case shut down the threads before the namenode.

          Show
          owen.omalley Owen O'Malley added a comment - The patch fixes the synchronization for the renewer thread and has the test case shut down the threads before the namenode.

            People

            • Assignee:
              owen.omalley Owen O'Malley
              Reporter:
              owen.omalley Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development