Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-988

saveNamespace race can corrupt the edits log

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20-append, 0.21.0, 0.22.0
    • Fix Version/s: 0.20-append, 0.20.205.0, 0.22.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      hbase

      Description

      The adminstrator puts the namenode is safemode and then issues the savenamespace command. This can corrupt the edits log. The problem is that when the NN enters safemode, there could still be pending logSycs occuring from other threads. Now, the saveNamespace command, when executed, would save a edits log with partial writes. I have seen this happen on 0.20.

      https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853

      1. 988-fixups.txt
        6 kB
        Todd Lipcon
      2. HDFS-988_fix_synchs.patch
        3 kB
        Matt Foley
      3. HDFS-988.20-security.patch
        10 kB
        Jitendra Nath Pandey
      4. hdfs-988.txt
        25 kB
        Todd Lipcon
      5. hdfs-988-2.patch
        34 kB
        Eli Collins
      6. hdfs-988-3.patch
        36 kB
        Eli Collins
      7. hdfs-988-4.patch
        50 kB
        Eli Collins
      8. hdfs-988-5.patch
        141 kB
        Eli Collins
      9. hdfs-988-6.patch
        177 kB
        Eli Collins
      10. hdfs-988-7.patch
        183 kB
        Eli Collins
      11. hdfs-988-b22-1.patch
        20 kB
        Eli Collins
      12. saveNamespace_20-append.patch
        9 kB
        Nicolas Spiegelberg
      13. saveNamespace.txt
        9 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Closed upon release of 0.20.205.0

          Show
          Matt Foley added a comment - Closed upon release of 0.20.205.0
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Suresh Srinivas added a comment -

          I committed the patch to 0.20-security branch.

          Show
          Suresh Srinivas added a comment - I committed the patch to 0.20-security branch.
          Hide
          Suresh Srinivas added a comment -

          +1 for the 20-security patch.

          Show
          Suresh Srinivas added a comment - +1 for the 20-security patch.
          Hide
          Jitendra Nath Pandey added a comment -

          Uploaded patch for 20-security branch.

          Show
          Jitendra Nath Pandey added a comment - Uploaded patch for 20-security branch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Eli and Todd, it seems that this issue has introduced a deadlock; see HDFS-2229.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Eli and Todd, it seems that this issue has introduced a deadlock; see HDFS-2229 .
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #746 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/746/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #746 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/746/ )
          Hide
          Eli Collins added a comment -

          Marking issue as fixed, we now have patches checked into branch-20-append, branch-22 and trunk.

          Show
          Eli Collins added a comment - Marking issue as fixed, we now have patches checked into branch-20-append, branch-22 and trunk.
          Hide
          Eli Collins added a comment -

          The jcarder run of the whole test suite went cleanly. I've committed this to trunk. Thanks for the reviews Todd!

          Show
          Eli Collins added a comment - The jcarder run of the whole test suite went cleanly. I've committed this to trunk. Thanks for the reviews Todd!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-22-branch #66 (See https://builds.apache.org/job/Hadoop-Hdfs-22-branch/66/)
          HDFS-988. saveNamespace race can corrupt the edits log. Contributed by Eli Collins

          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1134860
          Files :

          • /hadoop/hdfs/branches/branch-0.22/src/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
          • /hadoop/hdfs/branches/branch-0.22/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSafeMode.java
          • /hadoop/hdfs/branches/branch-0.22/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/hdfs/branches/branch-0.22/src/test/hdfs/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/hdfs/branches/branch-0.22/CHANGES.txt
          • /hadoop/hdfs/branches/branch-0.22/src/test/hdfs/org/apache/hadoop/hdfs/DFSTestUtil.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #66 (See https://builds.apache.org/job/Hadoop-Hdfs-22-branch/66/ ) HDFS-988 . saveNamespace race can corrupt the edits log. Contributed by Eli Collins eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1134860 Files : /hadoop/hdfs/branches/branch-0.22/src/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java /hadoop/hdfs/branches/branch-0.22/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSafeMode.java /hadoop/hdfs/branches/branch-0.22/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/hdfs/branches/branch-0.22/src/test/hdfs/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/hdfs/branches/branch-0.22/CHANGES.txt /hadoop/hdfs/branches/branch-0.22/src/test/hdfs/org/apache/hadoop/hdfs/DFSTestUtil.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12482272/hdfs-988-7.patch
          against trunk revision 1134869.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 36 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/775//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/775//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/775//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12482272/hdfs-988-7.patch against trunk revision 1134869. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 36 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/775//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/775//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/775//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Thanks Todd. I've committed the patch to branch-22.

          Attached patch for trunk is like the last just needed some minor rebasing. I'll do a full run of the test with the lockless branch of jcarder before committing.

          Show
          Eli Collins added a comment - Thanks Todd. I've committed the patch to branch-22. Attached patch for trunk is like the last just needed some minor rebasing. I'll do a full run of the test with the lockless branch of jcarder before committing.
          Hide
          Todd Lipcon added a comment -

          +1 on the trunk patch, once you've run the full test suite through jcarder (with the "lockclasses" branch that detects rwlock issues). Also looks like it needs a rebase

          Show
          Todd Lipcon added a comment - +1 on the trunk patch, once you've run the full test suite through jcarder (with the "lockclasses" branch that detects rwlock issues). Also looks like it needs a rebase
          Hide
          Todd Lipcon added a comment -

          +1 on the 0.22 patch

          Show
          Todd Lipcon added a comment - +1 on the 0.22 patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481892/hdfs-988-6.patch
          against trunk revision 1133476.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 33 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestHDFSTrash

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/747//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/747//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/747//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481892/hdfs-988-6.patch against trunk revision 1133476. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 33 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestHDFSTrash +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/747//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/747//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/747//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481892/hdfs-988-6.patch
          against trunk revision 1133476.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 33 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/746//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/746//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/746//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481892/hdfs-988-6.patch against trunk revision 1133476. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 33 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/746//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/746//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/746//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Btw test-patch is +1 for hdfs-988-b22-1.patch on branch 22 and all the tests pass. I think it's good to go, mind reviewing it? It's scope is limited to just the issue in the description (vs the latest patch for trunk).

          Show
          Eli Collins added a comment - Btw test-patch is +1 for hdfs-988-b22-1.patch on branch 22 and all the tests pass. I think it's good to go, mind reviewing it? It's scope is limited to just the issue in the description (vs the latest patch for trunk).
          Hide
          Eli Collins added a comment -

          Thanks for the thorough review Todd!

          computeDatanodeWork calls lockManager.computeReplicationWork and blockManager.computeInvalidateWork...

          Agree. Added a comment in computeDatanodeWork with the rationale.

          In heartbeatCheck, I think we can simply put another "if (isInSafeMode()) return" in right after it takes the writeLock if it finds a dead node...

          Agree, added the additional check, and a comment above the first lock aquisition.

          isLockedReadOrWrite should be checking this.fsLock.getReadHoldCount()

          rather than getReadLockCount()

          Fixed.

          FSDirectory#bLock says it protects the block map, but it also protects the directory, right? we should update the comment and perhaps the name.

          Correct. I renamed it dirLock to be consistent with FSNameSystem and to reflect that it protects the entire directory, not just the block map. Updated the comment too.

          various functions don't take the read lock because they call functions in FSDirectory that take FSDirectory.bLock. This seems incorrect..

          I think the FSN methods do need to get the readlock before calling into FSD methods that get the read lock. In the latest patch I added this to the places where it was missing (getFileInfo, getContentSummary, getPreferredBlockSize, etc), hopefully the new asserts make it clear that we have the FSN lock we calling into FSD (where necessary). For HADOOP-7362 we should assert the lock hierarchy is obeyed.

          handleHeartbeat calls getDatanode() while only holding locks on heartbeats and datanodeMap, but registerDatanode mutates datanodeMap without locking either.

          I modified handleHeartbeat to take the read lock so it's synchronized with register datanode. I also added a missing synchronization of datanodeMap to wipeDatanode. I think the locking in FSN needs to be revisited (eg the way heartbeats and datanodeMap are locked) need to be reconsidered in light of this locking. I'll file a jira for that, that's probably out of scope for this jira.

          getDataNodeInfo seems like an unused function with no locking - can we remove it?

          Yes. Done.

          several other places access datanodeMap with synchronization on that object itself. unprotectedAddDatanode should assert it holds that monitor lock

          Made unprotectedAddDatanode datanodeMap synchronize on datanodeMap, per above I think we need to revisit how this datastructure is accessed.

          when loading the edit log, why doesn't loadFSEdits take a write lock on the namesystem before it starts? then we could add all of the asserts and not worry about it.

          Done. It now takes the namesystem and dir locks, and I modifed all the unprotected methods to assert the lock is held.

          it looks like saving the image no longer works, since saveFilesUnderConstruction now takes the readLock, but it's being called by a different thread than took the write lock in saveNamespace.

          Yup, TestEditLogRace quickly found this deadlock. I removed the new locking from saveFilesUnderConstruction.

          When create() is called with the overwrite flag true, that calls delete() which will logSync() while holding the lock. We can hold off on fixing it since it's a performance problem, not correctness, and the operation is fairly rare.

          Filed HDFS-2052.

          getAdditionalBlock doesn't logSync() - I think there's another issue pending about that since it will affect HA. Let's address later.

          Since getAdditionalBlock doesn't do anything that writes to the log the sync is not currently needed. I updated HDFS-978 to indicate it will be needed when block allocations are logged.

          checkFileProgress doesn't really need the write lock

          Yup, modified it to use a read-lock.

          seems like saveNamespace could safely just take the read lock to allow other readers to keep working

          Yup, modified to take the read lock.

          Fixed the nits.

          The tests should be passing, I'm going to do some testing with jcarder enabled. Have also run TestNNThroughputBenchmark, the numbers are noisy, mostly better than w/o the patch (not sure I believe that).

          Show
          Eli Collins added a comment - Thanks for the thorough review Todd! computeDatanodeWork calls lockManager.computeReplicationWork and blockManager.computeInvalidateWork... Agree. Added a comment in computeDatanodeWork with the rationale. In heartbeatCheck, I think we can simply put another "if (isInSafeMode()) return" in right after it takes the writeLock if it finds a dead node... Agree, added the additional check, and a comment above the first lock aquisition. isLockedReadOrWrite should be checking this.fsLock.getReadHoldCount() rather than getReadLockCount() Fixed. FSDirectory#bLock says it protects the block map, but it also protects the directory, right? we should update the comment and perhaps the name. Correct. I renamed it dirLock to be consistent with FSNameSystem and to reflect that it protects the entire directory, not just the block map. Updated the comment too. various functions don't take the read lock because they call functions in FSDirectory that take FSDirectory.bLock. This seems incorrect.. I think the FSN methods do need to get the readlock before calling into FSD methods that get the read lock. In the latest patch I added this to the places where it was missing (getFileInfo, getContentSummary, getPreferredBlockSize, etc), hopefully the new asserts make it clear that we have the FSN lock we calling into FSD (where necessary). For HADOOP-7362 we should assert the lock hierarchy is obeyed. handleHeartbeat calls getDatanode() while only holding locks on heartbeats and datanodeMap, but registerDatanode mutates datanodeMap without locking either. I modified handleHeartbeat to take the read lock so it's synchronized with register datanode. I also added a missing synchronization of datanodeMap to wipeDatanode. I think the locking in FSN needs to be revisited (eg the way heartbeats and datanodeMap are locked) need to be reconsidered in light of this locking. I'll file a jira for that, that's probably out of scope for this jira. getDataNodeInfo seems like an unused function with no locking - can we remove it? Yes. Done. several other places access datanodeMap with synchronization on that object itself. unprotectedAddDatanode should assert it holds that monitor lock Made unprotectedAddDatanode datanodeMap synchronize on datanodeMap, per above I think we need to revisit how this datastructure is accessed. when loading the edit log, why doesn't loadFSEdits take a write lock on the namesystem before it starts? then we could add all of the asserts and not worry about it. Done. It now takes the namesystem and dir locks, and I modifed all the unprotected methods to assert the lock is held. it looks like saving the image no longer works, since saveFilesUnderConstruction now takes the readLock, but it's being called by a different thread than took the write lock in saveNamespace. Yup, TestEditLogRace quickly found this deadlock. I removed the new locking from saveFilesUnderConstruction. When create() is called with the overwrite flag true, that calls delete() which will logSync() while holding the lock. We can hold off on fixing it since it's a performance problem, not correctness, and the operation is fairly rare. Filed HDFS-2052 . getAdditionalBlock doesn't logSync() - I think there's another issue pending about that since it will affect HA. Let's address later. Since getAdditionalBlock doesn't do anything that writes to the log the sync is not currently needed. I updated HDFS-978 to indicate it will be needed when block allocations are logged. checkFileProgress doesn't really need the write lock Yup, modified it to use a read-lock. seems like saveNamespace could safely just take the read lock to allow other readers to keep working Yup, modified to take the read lock. Fixed the nits. The tests should be passing, I'm going to do some testing with jcarder enabled. Have also run TestNNThroughputBenchmark, the numbers are noisy, mostly better than w/o the patch (not sure I believe that).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481197/988-fixups.txt
          against trunk revision 1130381.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/684//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481197/988-fixups.txt against trunk revision 1130381. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/684//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Attaching a few fixups on top of hdfs-988-5.patch, related to the below comments:

          Regarding the question about computeDatanodeWork/heartbeatCheck:

          computeDatanodeWork calls blockManager.computeReplicationWork and blockManager.computeInvalidateWork. In the case of computeReplicationWork, it might schedule some replications. This seems OK - worst case we get some extra replicas which will get fixed up later. In the case of computeInvalidateWork, it calls invalidateWorkForOneNode which takes the write lock and then checks safe mode before scheduling any deletions.

          In heartbeatCheck, I think we can simply put another "if (isInSafeMode()) return" in right after it takes the writeLock if it finds a dead node. That way if it races, it still doesn't take any actions based on it. Either way, I don't think this could corrupt anything since it won't write to the edit log.

          Some other notes:

          • isLockedReadOrWrite should be checking this.fsLock.getReadHoldCount() rather than getReadLockCount()
          • FSDirectory#bLock says it protects the block map, but it also protects the directory, right? we should update the comment and perhaps the name.
          • various functions don't take the read lock because they call functions in FSDirectory that take FSDirectory.bLock. This seems incorrect, since, for example, getListing() racing against open() with overwrite=true could return the directory with the file deleted but the new one not there yet. I guess what's confusing me is that it's not clear why some functions don't need readLock when they perform read operations. When is just the FSDirectory lock sufficient? It looks like a lot of the test failures above are due to this.
          • handleHeartbeat calls getDatanode() while only holding locks on heartbeats and datanodeMap, but registerDatanode mutates datanodeMap without locking either.
          • getDataNodeInfo seems like an unused function with no locking - can we remove it?
          • several other places access datanodeMap with synchronization on that object itself. unprotectedAddDatanode should assert it holds that monitor lock
          • when loading the edit log, why doesn't loadFSEdits take a write lock on the namesystem before it starts? then we could add all of the asserts and not worry about it.
          • it looks like saving the image no longer works, since saveFilesUnderConstruction now takes the readLock, but it's being called by a different thread than took the write lock in saveNamespace. So, it deadlocks. At first I thought this could be solved by just making saveNamespace take a read lock instead of write lock, but that actually doesn't work due to fairness – what can happen is that saveNamespace takes readLock, then some other thread comes along and queues up for the write lock. At the point, no further readers are allowed to take the read lock, because it's a fair lock. So, the image-writer thread locks up.

          Optimizations to address later:

          • When create() is called with the overwrite flag true, that calls delete() which will logSync() while holding the lock. We can hold off on fixing it since it's a performance problem, not correctness, and the operation is fairly rare.
          • getAdditionalBlock doesn't logSync() - I think there's another issue pending about that since it will affect HA. Let's address later.
          • checkFileProgress doesn't really need the write lock
          • seems like saveNamespace could safely just take the read lock to allow other readers to keep working

          Nits:

          • Typo: "Cnnot concat"
          • rollEditLog has comment saying "Checkpoint not created"
          • rollFSImage has the same issue, but at least has to do with checkpoints, so could be correct
          Show
          Todd Lipcon added a comment - Attaching a few fixups on top of hdfs-988-5.patch, related to the below comments: Regarding the question about computeDatanodeWork/heartbeatCheck: computeDatanodeWork calls blockManager.computeReplicationWork and blockManager.computeInvalidateWork. In the case of computeReplicationWork, it might schedule some replications. This seems OK - worst case we get some extra replicas which will get fixed up later. In the case of computeInvalidateWork, it calls invalidateWorkForOneNode which takes the write lock and then checks safe mode before scheduling any deletions. In heartbeatCheck, I think we can simply put another "if (isInSafeMode()) return" in right after it takes the writeLock if it finds a dead node. That way if it races, it still doesn't take any actions based on it. Either way, I don't think this could corrupt anything since it won't write to the edit log. Some other notes: isLockedReadOrWrite should be checking this.fsLock.getReadHoldCount() rather than getReadLockCount() FSDirectory#bLock says it protects the block map, but it also protects the directory, right? we should update the comment and perhaps the name. various functions don't take the read lock because they call functions in FSDirectory that take FSDirectory.bLock. This seems incorrect, since, for example, getListing() racing against open() with overwrite=true could return the directory with the file deleted but the new one not there yet. I guess what's confusing me is that it's not clear why some functions don't need readLock when they perform read operations. When is just the FSDirectory lock sufficient? It looks like a lot of the test failures above are due to this. handleHeartbeat calls getDatanode() while only holding locks on heartbeats and datanodeMap, but registerDatanode mutates datanodeMap without locking either. getDataNodeInfo seems like an unused function with no locking - can we remove it? several other places access datanodeMap with synchronization on that object itself. unprotectedAddDatanode should assert it holds that monitor lock when loading the edit log, why doesn't loadFSEdits take a write lock on the namesystem before it starts? then we could add all of the asserts and not worry about it. it looks like saving the image no longer works, since saveFilesUnderConstruction now takes the readLock, but it's being called by a different thread than took the write lock in saveNamespace. So, it deadlocks. At first I thought this could be solved by just making saveNamespace take a read lock instead of write lock, but that actually doesn't work due to fairness – what can happen is that saveNamespace takes readLock, then some other thread comes along and queues up for the write lock. At the point, no further readers are allowed to take the read lock, because it's a fair lock. So, the image-writer thread locks up. Optimizations to address later: When create() is called with the overwrite flag true, that calls delete() which will logSync() while holding the lock. We can hold off on fixing it since it's a performance problem, not correctness, and the operation is fairly rare. getAdditionalBlock doesn't logSync() - I think there's another issue pending about that since it will affect HA. Let's address later. checkFileProgress doesn't really need the write lock seems like saveNamespace could safely just take the read lock to allow other readers to keep working Nits: Typo: "Cnnot concat" rollEditLog has comment saying "Checkpoint not created" rollFSImage has the same issue, but at least has to do with checkpoints, so could be correct
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481173/hdfs-988-5.patch
          against trunk revision 1130339.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 18 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestCheckPointForSecurityTokens
          org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
          org.apache.hadoop.hdfs.server.namenode.TestEditLogRace
          org.apache.hadoop.hdfs.server.namenode.TestParallelImageWrite
          org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace
          org.apache.hadoop.hdfs.server.namenode.TestStartup
          org.apache.hadoop.hdfs.TestDFSFinalize
          org.apache.hadoop.hdfs.TestDFSRollback
          org.apache.hadoop.hdfs.TestDFSStartupVersions
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
          org.apache.hadoop.hdfs.TestDFSUpgrade
          org.apache.hadoop.hdfs.TestListFilesInDFS
          org.apache.hadoop.hdfs.TestListFilesInFileContext
          org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/679//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/679//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/679//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481173/hdfs-988-5.patch against trunk revision 1130339. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestCheckPointForSecurityTokens org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.server.namenode.TestEditLogRace org.apache.hadoop.hdfs.server.namenode.TestParallelImageWrite org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace org.apache.hadoop.hdfs.server.namenode.TestStartup org.apache.hadoop.hdfs.TestDFSFinalize org.apache.hadoop.hdfs.TestDFSRollback org.apache.hadoop.hdfs.TestDFSStartupVersions org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.TestDFSUpgrade org.apache.hadoop.hdfs.TestListFilesInDFS org.apache.hadoop.hdfs.TestListFilesInFileContext org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/679//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/679//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/679//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481191/hdfs-988-b22-1.patch
          against trunk revision 1130381.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/683//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481191/hdfs-988-b22-1.patch against trunk revision 1130381. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/683//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Minimal patch for branch 22 with tests attached.

          Show
          Eli Collins added a comment - Minimal patch for branch 22 with tests attached.
          Hide
          Eli Collins added a comment -

          Thanks for taking a look Todd. Updated patch attached.

          checks for if (auditLog.isInfoEnabled()) should probably now be (auditLog.isInfoEnabled() && isExternalInvocation()) – otherwise we're doing a needless directory traversal for fsck

          Fixed.

          The following methods currently do logSync() while holding the writeLock, which is expensive:

          Fixed. (Only one needed to conditionally call logSync)

          seems strange that some of the xInternal() methods take the write lock themselves (eg setReplicationInternal) whereas others assume the caller takes the write lock (eg createSymlinkInternal). We should be consistent.

          Latest patch makes them more consistent, I also refactored out a couple new xInternal methods. In a couple places (eg deleteInternal and getListing) I didn't hoist up the locking because it would make the locking too coarse-grain (eg would result in syncing the log w/ the lock held).

          for those methods that don't explicitly take the write lock, we should either add an assert hasWriteLock() or a comment explaining why the lock is not necessary (eg internalReleaseLease, reassignLease, finalizeINodeFileUnderConstruction)

          Done. For FSDirectory I made the unprotectedX methods actually unprotected and moved the locking to the caller (except for FSEditLogLoader which calls the unprotected methods directly on purpose - I doubt this really saves us that much). These methods (per their name) are now intentionally unprotected.

          comment for endCheckpoint says "not started" but should say "not ended". same with updatePipeline.

          Both fixed.

          why doesn't getListing need the read lock?

          Because its callees (check*, getListing) take the lock.

          I noticed that nextGenerationStamp() doesn't logSync() – that seems dangerous, since after a restart we might hand out a duplicate genstamp.

          Good catch. I made sure all callers sync the log (this was only missing from the updateBlockForPipeline path). nextGenerationStamp is always called with the lock held so I asserted that and removed the lock aquisition from this method.

          Show
          Eli Collins added a comment - Thanks for taking a look Todd. Updated patch attached. checks for if (auditLog.isInfoEnabled()) should probably now be (auditLog.isInfoEnabled() && isExternalInvocation()) – otherwise we're doing a needless directory traversal for fsck Fixed. The following methods currently do logSync() while holding the writeLock, which is expensive: Fixed. (Only one needed to conditionally call logSync) seems strange that some of the xInternal() methods take the write lock themselves (eg setReplicationInternal) whereas others assume the caller takes the write lock (eg createSymlinkInternal). We should be consistent. Latest patch makes them more consistent, I also refactored out a couple new xInternal methods. In a couple places (eg deleteInternal and getListing) I didn't hoist up the locking because it would make the locking too coarse-grain (eg would result in syncing the log w/ the lock held). for those methods that don't explicitly take the write lock, we should either add an assert hasWriteLock() or a comment explaining why the lock is not necessary (eg internalReleaseLease, reassignLease, finalizeINodeFileUnderConstruction) Done. For FSDirectory I made the unprotectedX methods actually unprotected and moved the locking to the caller (except for FSEditLogLoader which calls the unprotected methods directly on purpose - I doubt this really saves us that much). These methods (per their name) are now intentionally unprotected. comment for endCheckpoint says "not started" but should say "not ended". same with updatePipeline. Both fixed. why doesn't getListing need the read lock? Because its callees (check*, getListing) take the lock. I noticed that nextGenerationStamp() doesn't logSync() – that seems dangerous, since after a restart we might hand out a duplicate genstamp. Good catch. I made sure all callers sync the log (this was only missing from the updateBlockForPipeline path). nextGenerationStamp is always called with the lock held so I asserted that and removed the lock aquisition from this method.
          Hide
          Eli Collins added a comment -

          ELOS#flush calls ELFOS#flushAndSync which does a force on the underlying file channel.

          Show
          Eli Collins added a comment - ELOS#flush calls ELFOS#flushAndSync which does a force on the underlying file channel.
          Hide
          Bharath Mundlapudi added a comment -

          I am just wondering, if we are calling os sync at all on this code path. All i see is flush call which flushes from EditLogOutputStream (java buffers) to kernel buffers.

          Shouldn't we be doing the following?

          eStream.flush();
          eStream.getFileOutputStream().getFD().sync();

          This will make sure the edits are actually written to disk. Is there any reason for not doing this?

          Show
          Bharath Mundlapudi added a comment - I am just wondering, if we are calling os sync at all on this code path. All i see is flush call which flushes from EditLogOutputStream (java buffers) to kernel buffers. Shouldn't we be doing the following? eStream.flush(); eStream.getFileOutputStream().getFD().sync(); This will make sure the edits are actually written to disk. Is there any reason for not doing this?
          Hide
          Eli Collins added a comment -

          It looks like most of the unprotected* methods take the rwlock, but don't need to because either because their caller takes the lock or they are called from loading the edit log (which is why we originally had unprotected versions). Do people mind if I fix that up (remove the locking from these methods, make sure the unprotected versions are only called when loading the log) in this change or do people want that done in a separate change?

          Show
          Eli Collins added a comment - It looks like most of the unprotected* methods take the rwlock, but don't need to because either because their caller takes the lock or they are called from loading the edit log (which is why we originally had unprotected versions). Do people mind if I fix that up (remove the locking from these methods, make sure the unprotected versions are only called when loading the log) in this change or do people want that done in a separate change?
          Hide
          dhruba borthakur added a comment -

          we should not logSync after a nextGenerationStamp() because it is invoked for every file creation, isn't it? and before the file-create call returns from the NN, we do invoke logSync(), so we shud be safe.

          The other time we invoke nextGenerationStamp() is for pipeline error recovery, and that code path should also be ok, let me think.

          Show
          dhruba borthakur added a comment - we should not logSync after a nextGenerationStamp() because it is invoked for every file creation, isn't it? and before the file-create call returns from the NN, we do invoke logSync(), so we shud be safe. The other time we invoke nextGenerationStamp() is for pipeline error recovery, and that code path should also be ok, let me think.
          Hide
          Todd Lipcon added a comment -

          Didn't go through the new tests yet, but here are some comments from a first pass through FSN:

          • checks for if (auditLog.isInfoEnabled()) should probably now be (auditLog.isInfoEnabled() && isExternalInvocation()) – otherwise we're doing a needless directory traversal for fsck
          • The following methods currently do logSync() while holding the writeLock, which is expensive:
            • setPermission
            • setOwner
            • commitBlockSynchronization (in some exit paths)
            • updatePipeline
          • These methods should probably just set a local boolean within the synchronized section, then logSync() in the finally clause if it's flagged
          • seems strange that some of the xInternal() methods take the write lock themselves (eg setReplicationInternal) whereas others assume the caller takes the write lock (eg createSymlinkInternal). We should be consistent
          • for those methods that don't explicitly take the write lock, we should either add an assert hasWriteLock() or a comment explaining why the lock is not necessary (eg internalReleaseLease, reassignLease, finalizeINodeFileUnderConstruction)
          • why doesn't getListing need the read lock?
          • comment for endCheckpoint says "not started" but should say "not ended"
          • same with updatePipeline
          • I noticed that nextGenerationStamp() doesn't logSync() – that seems dangerous, since after a restart we might hand out a duplicate genstamp.
          Show
          Todd Lipcon added a comment - Didn't go through the new tests yet, but here are some comments from a first pass through FSN: checks for if (auditLog.isInfoEnabled()) should probably now be (auditLog.isInfoEnabled() && isExternalInvocation()) – otherwise we're doing a needless directory traversal for fsck The following methods currently do logSync() while holding the writeLock, which is expensive: setPermission setOwner commitBlockSynchronization (in some exit paths) updatePipeline These methods should probably just set a local boolean within the synchronized section, then logSync() in the finally clause if it's flagged seems strange that some of the xInternal() methods take the write lock themselves (eg setReplicationInternal) whereas others assume the caller takes the write lock (eg createSymlinkInternal). We should be consistent for those methods that don't explicitly take the write lock, we should either add an assert hasWriteLock() or a comment explaining why the lock is not necessary (eg internalReleaseLease, reassignLease, finalizeINodeFileUnderConstruction) why doesn't getListing need the read lock? comment for endCheckpoint says "not started" but should say "not ended" same with updatePipeline I noticed that nextGenerationStamp() doesn't logSync() – that seems dangerous, since after a restart we might hand out a duplicate genstamp.
          Hide
          Eli Collins added a comment -

          Updated patch attached.

          • Added TestSafeMode#testOperationsWhileInSafeMode
          • Removed duplicate copy of TestSafeMode#testManualSafeMode
          • Modified FSNamesystem#getNamespaceInfo to take the lock for reading instead of writing (don't see why the original patch took it for writing as it doesn't mutate any state).
          • TestEditLogRace#testSaveNamespace actually covers the scenario addressed by this patch. I modified it to call setQuota instead of mkdirs/delete and w/o this patch it asserts/NPEs, with this patch it loops w/o error. We could modify testSaveNamespace to run with all the fs operations though I'm not sure the additional coverage would be worth the additional execution time (perhaps it could randomly select from a pre-canned set of test methods or fuzz the API).

          Do people concur with my previous comment about computeDatanodeWork and heartbeatCheck?

          Show
          Eli Collins added a comment - Updated patch attached. Added TestSafeMode#testOperationsWhileInSafeMode Removed duplicate copy of TestSafeMode#testManualSafeMode Modified FSNamesystem#getNamespaceInfo to take the lock for reading instead of writing (don't see why the original patch took it for writing as it doesn't mutate any state). TestEditLogRace#testSaveNamespace actually covers the scenario addressed by this patch. I modified it to call setQuota instead of mkdirs/delete and w/o this patch it asserts/NPEs, with this patch it loops w/o error. We could modify testSaveNamespace to run with all the fs operations though I'm not sure the additional coverage would be worth the additional execution time (perhaps it could randomly select from a pre-canned set of test methods or fuzz the API). Do people concur with my previous comment about computeDatanodeWork and heartbeatCheck?
          Hide
          Eli Collins added a comment -

          Updated patch attached. Merges Matt's HDFS-988_fix_synchs.patch with my earlier hdfs-988-2.patch (based on Todd's patch).

          • No longer throw a SafeModeException in setTimes
          • Uses rw lock in getNamespaceInfo, setQuota, renewLease, nextGenerationStamp.
          • Use the same technique (get a strong ref to this.safeMode) from isInSafeMode, isInStartupSafeMode, and isPopulatingReplQueues in checkSafeMode.

          Also, I noticed computeDatanodeWork and heartbeatCheck check isInSafeMode w/o holding the lock, that's no kosher as well right? Not great to hold the lock here since both of these methods can be time-intensive, but it seems racy not to.

          I'm working on a test.

          Show
          Eli Collins added a comment - Updated patch attached. Merges Matt's HDFS-988 _fix_synchs.patch with my earlier hdfs-988-2.patch (based on Todd's patch). No longer throw a SafeModeException in setTimes Uses rw lock in getNamespaceInfo, setQuota, renewLease, nextGenerationStamp. Use the same technique (get a strong ref to this.safeMode) from isInSafeMode, isInStartupSafeMode, and isPopulatingReplQueues in checkSafeMode. Also, I noticed computeDatanodeWork and heartbeatCheck check isInSafeMode w/o holding the lock, that's no kosher as well right? Not great to hold the lock here since both of these methods can be time-intensive, but it seems racy not to. I'm working on a test.
          Hide
          Todd Lipcon added a comment -

          removing patch available status since this still needs to be finished up.

          Show
          Todd Lipcon added a comment - removing patch available status since this still needs to be finished up.
          Hide
          Matt Foley added a comment -

          Out of the seven test failures, the only one that might have to do with this patch is

          • org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage.testInjection
            But I think it's unlikely.

          In case it wasn't clear, I'm offering this patch file as a possibly useful portion of the solution for this bug, not as a solution in its own right. Feel free to incorporate all or parts of it. Or not.

          Show
          Matt Foley added a comment - Out of the seven test failures, the only one that might have to do with this patch is org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage.testInjection But I think it's unlikely. In case it wasn't clear, I'm offering this patch file as a possibly useful portion of the solution for this bug, not as a solution in its own right. Feel free to incorporate all or parts of it. Or not.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479629/HDFS-988_fix_synchs.patch
          against trunk revision 1124364.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/559//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/559//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/559//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479629/HDFS-988_fix_synchs.patch against trunk revision 1124364. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/559//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/559//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/559//console This message is automatically generated.
          Hide
          Matt Foley added a comment -

          Here is an extract from some other stuff I've been working on, that addresses the sync issue for calls into SafeMode methods. It avoids taking the r/w lock for fast read-only operations, where it seems to me it can be made safe with a lighter-weight mechanism.

          This patch does not address the need Todd observed to add r/w lock to

          • getNamespaceInfo
          • setQuota
          • renewLease
          • nextGenerationStamp
          Show
          Matt Foley added a comment - Here is an extract from some other stuff I've been working on, that addresses the sync issue for calls into SafeMode methods. It avoids taking the r/w lock for fast read-only operations, where it seems to me it can be made safe with a lighter-weight mechanism. This patch does not address the need Todd observed to add r/w lock to getNamespaceInfo setQuota renewLease nextGenerationStamp
          Hide
          dhruba borthakur added a comment -

          yes, they should all be using RWLock, absolutely.

          Show
          dhruba borthakur added a comment - yes, they should all be using RWLock, absolutely.
          Hide
          Todd Lipcon added a comment -

          Looking at trunk I see a lot of other strange synchronization issues. The following methods all are synchronized on the FSNamesystem instance:

          • getNamespaceInfo
          • setQuota
          • renewLease
          • isInSafeMode
          • isInStartupSafeMode
          • isPopulatingReplQueues
          • nextGenerationStamp

          I think all of these should probably be using the new rwlock... Dhruba, what do you think?

          Maybe we need something more like a stress/fuzz test against FSNamesystem rather than trying to target the specific cases mentioned above?

          Show
          Todd Lipcon added a comment - Looking at trunk I see a lot of other strange synchronization issues. The following methods all are synchronized on the FSNamesystem instance: getNamespaceInfo setQuota renewLease isInSafeMode isInStartupSafeMode isPopulatingReplQueues nextGenerationStamp I think all of these should probably be using the new rwlock... Dhruba, what do you think? Maybe we need something more like a stress/fuzz test against FSNamesystem rather than trying to target the specific cases mentioned above?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478366/hdfs-988-2.patch
          against trunk revision 1100054.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/462//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/462//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/462//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478366/hdfs-988-2.patch against trunk revision 1100054. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/462//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/462//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/462//console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          1. we should not throw an exception is unable to update accesstime because NN is in safemode. This will prevent a adminstrator from running read-only operations (dfs -cat) on a cluster that is just starting up and is still in safemode. It shud just log a warning message and continue

          2. Unit test: Create a cluster, open a file for write and and put cluster in safemode. verify that getAdditionalBlock, close, commitBlockSync etc fails when NN is in safemode

          Show
          dhruba borthakur added a comment - 1. we should not throw an exception is unable to update accesstime because NN is in safemode. This will prevent a adminstrator from running read-only operations (dfs -cat) on a cluster that is just starting up and is still in safemode. It shud just log a warning message and continue 2. Unit test: Create a cluster, open a file for write and and put cluster in safemode. verify that getAdditionalBlock, close, commitBlockSync etc fails when NN is in safemode
          Hide
          Eli Collins added a comment -

          Updated patch from trunk attached, as trunk has moved on a bit, eg introduced read/write locking.

          @Todd, would you mind reviewing this latest version?

          @Dhruba and co, any specific suggestions for a unit test?

          Show
          Eli Collins added a comment - Updated patch from trunk attached, as trunk has moved on a bit, eg introduced read/write locking. @Todd, would you mind reviewing this latest version? @Dhruba and co, any specific suggestions for a unit test?
          Hide
          Nigel Daley added a comment -

          Todd, any update on this for 0.22?

          Show
          Nigel Daley added a comment - Todd, any update on this for 0.22?
          Hide
          Nigel Daley added a comment -

          This is committed to 0.20-append but need a unit test for trunk.

          Show
          Nigel Daley added a comment - This is committed to 0.20-append but need a unit test for trunk.
          Hide
          dhruba borthakur added a comment -

          It is not a easily repeatable scenario: it is a race condition.

          Show
          dhruba borthakur added a comment - It is not a easily repeatable scenario: it is a race condition.
          Hide
          Brahma Reddy Battula added a comment -

          hi dhruba borthakur ,

          I am using the 20.1 version..
          I put NameNode in safemode and then i executed the save namespace,,But editlogs are not corrupted..
          can u please give exact scenario to reproduce this.

          Show
          Brahma Reddy Battula added a comment - hi dhruba borthakur , I am using the 20.1 version.. I put NameNode in safemode and then i executed the save namespace,,But editlogs are not corrupted.. can u please give exact scenario to reproduce this.
          Hide
          dhruba borthakur added a comment -

          hi todd, would appreciate it if u can write some sort of a unit test for this one, that will help getting this into trunk.

          Show
          dhruba borthakur added a comment - hi todd, would appreciate it if u can write some sort of a unit test for this one, that will help getting this into trunk.
          Hide
          Todd Lipcon added a comment -

          Marking this as fixed for the append branch (it's committed there, but not resolved for trunk yet)

          Show
          Todd Lipcon added a comment - Marking this as fixed for the append branch (it's committed there, but not resolved for trunk yet)
          Hide
          Todd Lipcon added a comment -

          Nicolas: by seqnum you're referring to the generation stamp on the block being written, as the client continues to initiate recovery during shutdown? Do we also need to pull in HDFS-1145 then? (I have this patch applied on my append branch, but not 1145, and not seen errors, just seems to make sense)

          Show
          Todd Lipcon added a comment - Nicolas: by seqnum you're referring to the generation stamp on the block being written, as the client continues to initiate recovery during shutdown? Do we also need to pull in HDFS-1145 then? (I have this patch applied on my append branch, but not 1145, and not seen errors, just seems to make sense)
          Hide
          Nicolas Spiegelberg added a comment -

          Version of saveNamespace for the 0.20-append branch. This patch is needed for many of the unit tests to pass. Without, as DNs die via shutdown(), the remaining DNs get +1 seqnum. So, only the last DN has the highest seqnum on restart, which messes up some tests. This actually causes a real world problem then if the last DN has corrupt data.

          Show
          Nicolas Spiegelberg added a comment - Version of saveNamespace for the 0.20-append branch. This patch is needed for many of the unit tests to pass. Without, as DNs die via shutdown(), the remaining DNs get +1 seqnum. So, only the last DN has the highest seqnum on restart, which messes up some tests. This actually causes a real world problem then if the last DN has corrupt data.
          Hide
          Nicolas Spiegelberg added a comment -

          This should be pulled into the branch-0.20-append branch. Besides stability fix, causes append tests to fail.

          Show
          Nicolas Spiegelberg added a comment - This should be pulled into the branch-0.20-append branch. Besides stability fix, causes append tests to fail.
          Hide
          dhruba borthakur added a comment -

          hi todd, can we get a unit test for this one (as described in the earlier comment)? Thanks.

          Show
          dhruba borthakur added a comment - hi todd, can we get a unit test for this one (as described in the earlier comment)? Thanks.
          Hide
          dhruba borthakur added a comment -

          Code looks excellent. can we add a unit test that does the following:

          • Create a cluster, open a file for write and and put in safemode. verify that getAdditionalBlock, close etc fails when NN is in safemode
          Show
          dhruba borthakur added a comment - Code looks excellent. can we add a unit test that does the following: Create a cluster, open a file for write and and put in safemode. verify that getAdditionalBlock, close etc fails when NN is in safemode
          Hide
          Todd Lipcon added a comment -

          dhruba, can you take a look at this blocker please?

          Show
          Todd Lipcon added a comment - dhruba, can you take a look at this blocker please?
          Hide
          Todd Lipcon added a comment -

          The failed test is unrelated (been failing all patch builds lately)

          Can anyone suggest a unit test for this issue? I can't think of any to add since functionality wasn't changed, we just cleaned up some potential races.

          Show
          Todd Lipcon added a comment - The failed test is unrelated (been failing all patch builds lately) Can anyone suggest a unit test for this issue? I can't think of any to add since functionality wasn't changed, we just cleaned up some potential races.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443047/hdfs-988.txt
          against trunk revision 938791.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443047/hdfs-988.txt against trunk revision 938791. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/331/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Marking this as blocker for the upcoming release. If anyone disagrees please shout

          Show
          Todd Lipcon added a comment - Marking this as blocker for the upcoming release. If anyone disagrees please shout
          Hide
          Todd Lipcon added a comment -

          Attaching an updated patch for trunk. Additions:

          • commitBlockSynchronization no longer allowed in safemode. Konstantin, do you still prefer that we open a new issue for this? Dhruba seems to agree that it should be disallowed.
          • startCheckpoint, endCheckpoint, and updatePipeline also check safemode now
          • the new delegation token logging methods check safemode as well.

          Some questions for review:

          • Will logUpdateMasterKey be OK with the SafeModeException?
          • Are there some asserts we could add to make it easier to catch these bugs in the future? For example, we could assert !namesystem.isInSafeMode() in FSEditLog.logSync(). Then if we ran assertions on unit tests we'd probably notice if we were accidentally making edits while in safe mode.

          Some responses to review above:

          FSNamesystem.getAdditionalBlock() checking isInSafeMode() should be before calling chooseTargets(). I would not change getAdditionalBlock() at all.

          Right now, getAdditionalBlock is split into two synchronized blocks. The safe mode status could switch between the two. Are you suggesting that we check safemode in both, or we combine the blocks into one? I assumed the intent was to avoid doing the potentially CPU-heavy chooseTarget work while synchronized.

          renewLease() shouldn't be under FSNamesystem lock? leaseManeger has its own lock

          This is to prevent safemode from switching while calling renewLease. If we decided that renewing a lease under safemode is not allowed, then we need to synchronize here. Otherwise the check is prone to races.

          Regarding deadlock potential, I think we're safe since LeaseManager.Monitor synchronizes on FSNamesystem before synchronizing on the lease manager.

          Your changes to permission methods incorporate HDFS-133.

          Resolved that one as dup, thanks.

          Show
          Todd Lipcon added a comment - Attaching an updated patch for trunk. Additions: commitBlockSynchronization no longer allowed in safemode. Konstantin, do you still prefer that we open a new issue for this? Dhruba seems to agree that it should be disallowed. startCheckpoint, endCheckpoint, and updatePipeline also check safemode now the new delegation token logging methods check safemode as well. Some questions for review: Will logUpdateMasterKey be OK with the SafeModeException? Are there some asserts we could add to make it easier to catch these bugs in the future? For example, we could assert !namesystem.isInSafeMode() in FSEditLog.logSync(). Then if we ran assertions on unit tests we'd probably notice if we were accidentally making edits while in safe mode. Some responses to review above: FSNamesystem.getAdditionalBlock() checking isInSafeMode() should be before calling chooseTargets(). I would not change getAdditionalBlock() at all. Right now, getAdditionalBlock is split into two synchronized blocks. The safe mode status could switch between the two. Are you suggesting that we check safemode in both, or we combine the blocks into one? I assumed the intent was to avoid doing the potentially CPU-heavy chooseTarget work while synchronized. renewLease() shouldn't be under FSNamesystem lock? leaseManeger has its own lock This is to prevent safemode from switching while calling renewLease. If we decided that renewing a lease under safemode is not allowed, then we need to synchronize here. Otherwise the check is prone to races. Regarding deadlock potential, I think we're safe since LeaseManager.Monitor synchronizes on FSNamesystem before synchronizing on the lease manager. Your changes to permission methods incorporate HDFS-133 . Resolved that one as dup, thanks.
          Hide
          Todd Lipcon added a comment -

          Hey Dhruba, yep, I am planning on doing this this week so it gets in the release. Also, I agree re commitBlockSynchronization, though I think we should probably add a couple more tests in this area.

          Show
          Todd Lipcon added a comment - Hey Dhruba, yep, I am planning on doing this this week so it gets in the release. Also, I agree re commitBlockSynchronization, though I think we should probably add a couple more tests in this area.
          Hide
          dhruba borthakur added a comment -

          Hi todd, is it possible to make this one into the re-release of 0.21?

          I do not think we should allow FSNamesystem.commitBlockSynchronization() in safe mode. Do you visualize a case when this could cause a problem? I like it safer when I can say that no new transactions can make it to the edits log when the namenode is in safemode.

          Show
          dhruba borthakur added a comment - Hi todd, is it possible to make this one into the re-release of 0.21? I do not think we should allow FSNamesystem.commitBlockSynchronization() in safe mode. Do you visualize a case when this could cause a problem? I like it safer when I can say that no new transactions can make it to the edits log when the namenode is in safemode.
          Hide
          Todd Lipcon added a comment -

          Once HDFS-909 is committed I'll rebase this on trunk and take care of Konstantin's review comments

          Show
          Todd Lipcon added a comment - Once HDFS-909 is committed I'll rebase this on trunk and take care of Konstantin's review comments
          Hide
          Konstantin Shvachko added a comment -
          1. enterSafeMode() looks good.
          2. FSNamesystem.getAdditionalBlock() checking isInSafeMode() should be before calling chooseTargets(). I would not change getAdditionalBlock() at all.
          3. I think it is fine to FSNamesystem.commitBlockSynchronization() in safe mode. In any case it probably deserves a separate issue and discussion.
          4. renewLease() shouldn't be under FSNamesystem lock? leaseManeger has its own lock.
          5. Your changes to permission methods incorporate HDFS-133.
          6. Dhruba, are you going to promote changes to trunk and 0.21?
          Show
          Konstantin Shvachko added a comment - enterSafeMode() looks good. FSNamesystem.getAdditionalBlock() checking isInSafeMode() should be before calling chooseTargets() . I would not change getAdditionalBlock() at all. I think it is fine to FSNamesystem.commitBlockSynchronization() in safe mode. In any case it probably deserves a separate issue and discussion. renewLease() shouldn't be under FSNamesystem lock? leaseManeger has its own lock. Your changes to permission methods incorporate HDFS-133 . Dhruba, are you going to promote changes to trunk and 0.21?
          Hide
          Todd Lipcon added a comment -

          Hi Dhruba,

          I still think we should fix this in the other issues and then backport to 20. But I'll do a review of this patch here since you've already uploaded it:

          • in setPermission, the audit logging has moved outside the synchronized block. Thus dir.getFileInfo may actually return incorrect info (or even fail if it races with someone deleting the file)
          • same goes for setOwner
          • I think it's OK, but can you verify that the top synchronized block in getAdditionalBlock can never have side effects? I don't know the lease management code well enough - checkLease is guaranteed side-effect free?
          Show
          Todd Lipcon added a comment - Hi Dhruba, I still think we should fix this in the other issues and then backport to 20. But I'll do a review of this patch here since you've already uploaded it: in setPermission, the audit logging has moved outside the synchronized block. Thus dir.getFileInfo may actually return incorrect info (or even fail if it races with someone deleting the file) same goes for setOwner I think it's OK, but can you verify that the top synchronized block in getAdditionalBlock can never have side effects? I don't know the lease management code well enough - checkLease is guaranteed side-effect free?
          Hide
          dhruba borthakur added a comment -

          This patch is for hadoop 0.20. It fixes the race of saveNamespace with setting safemode.

          Show
          dhruba borthakur added a comment - This patch is for hadoop 0.20. It fixes the race of saveNamespace with setting safemode.
          Hide
          Todd Lipcon added a comment -

          HDFS-909's fixes cover two issues. One of the two is in 20, and other doesn't appear to be. But we may as well just have one JIRA number. The patch to 20 will just be a subset of the changes from trunk - no sense in diverging fix implementations.

          Show
          Todd Lipcon added a comment - HDFS-909 's fixes cover two issues. One of the two is in 20, and other doesn't appear to be. But we may as well just have one JIRA number. The patch to 20 will just be a subset of the changes from trunk - no sense in diverging fix implementations.
          Hide
          dhruba borthakur added a comment -

          I agree that HDFS-956 could be fixed. But HDFS-909 does not apply to 0.20, where this one does, isn't it?

          Show
          dhruba borthakur added a comment - I agree that HDFS-956 could be fixed. But HDFS-909 does not apply to 0.20, where this one does, isn't it?
          Hide
          Todd Lipcon added a comment -

          Does this need to be a separate JIRA from HDFS-909? This code is covered by that patch.

          Also I think HDFS-956 needs to be fixed for this to truly catch all cases.

          Show
          Todd Lipcon added a comment - Does this need to be a separate JIRA from HDFS-909 ? This code is covered by that patch. Also I think HDFS-956 needs to be fixed for this to truly catch all cases.
          Hide
          dhruba borthakur added a comment -

          I think the wait time is theoretically unbounded, but waiting for 5 minutes seem like a pretty fool-proof idea to me.

          Show
          dhruba borthakur added a comment - I think the wait time is theoretically unbounded, but waiting for 5 minutes seem like a pretty fool-proof idea to me.
          Hide
          dhruba borthakur added a comment -

          My proposal is to make the enterSafeMode method wait for all pending transactions to get flushed.

            synchronized void FSNamesystem.enterSafeMode() throws IOException {
              if (!isInSafeMode()) {
                safeMode = new SafeModeInfo();
                return;
              }
              safeMode.setManual();
              getEditLog().logSyncAll();          <======= new code here
              NameNode.stateChangeLog.info("STATE* Safe mode is ON. "
                                          + safeMode.getTurnOffTip());
            }
          
            synchronized void FSEditLog.logSyncAll() throws IOException {
              TransactionId id = myTransactionId.get();
              id.txid = txid;
              logSync();
            } 
          
          
          Show
          dhruba borthakur added a comment - My proposal is to make the enterSafeMode method wait for all pending transactions to get flushed. synchronized void FSNamesystem.enterSafeMode() throws IOException { if (!isInSafeMode()) { safeMode = new SafeModeInfo(); return ; } safeMode.setManual(); getEditLog().logSyncAll(); <======= new code here NameNode.stateChangeLog.info( "STATE* Safe mode is ON. " + safeMode.getTurnOffTip()); } synchronized void FSEditLog.logSyncAll() throws IOException { TransactionId id = myTransactionId.get(); id.txid = txid; logSync(); }
          Hide
          Brian Bockelman added a comment -

          How long can logSyncs be pending for? Is this corruption still possible if the sysadmin waits, say, 5 minutes?

          Show
          Brian Bockelman added a comment - How long can logSyncs be pending for? Is this corruption still possible if the sysadmin waits, say, 5 minutes?

            People

            • Assignee:
              Eli Collins
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development