Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-909

Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.20.2, 0.21.0, 0.22.0
    • Fix Version/s: 0.20.3, 0.21.0
    • Component/s: namenode
    • Labels:
      None
    • Environment:

      CentOS

    • Hadoop Flags:
      Reviewed

      Description

      closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log.

      Example:

      FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
      FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
      OR
      FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
      FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
      
      VERSUS
      
      FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
      FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
      OR 
      Any FSEditLog.write
      

      Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above

      So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart.

      We got to this point after investigating a corrupted edits file that made HDFS unable to start with

      namenode.log
      java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. 
              at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
      

      EDIT: moved the logs to a comment to make this readable

      1. hdfs-909-branch-0.21.txt
        25 kB
        Konstantin Shvachko
      2. hdfs-909-branch-0.20.txt
        25 kB
        Todd Lipcon
      3. hdfs-909-unified.txt
        26 kB
        Todd Lipcon
      4. hdfs-909-branch-0.20.txt
        25 kB
        Todd Lipcon
      5. hdfs-909-ammendation.txt
        7 kB
        Todd Lipcon
      6. hdfs-909-branch-0.21.txt
        24 kB
        Konstantin Shvachko
      7. ASF.LICENSE.NOT.GRANTED--hdfs-909-branch-0.20.txt
        25 kB
        Todd Lipcon
      8. ASF.LICENSE.NOT.GRANTED--hdfs-909.txt
        25 kB
        Todd Lipcon
      9. hdfs-909.txt
        25 kB
        Todd Lipcon
      10. hdfs-909.txt
        25 kB
        Todd Lipcon
      11. hdfs-909.txt
        12 kB
        Todd Lipcon
      12. hdfs-909.txt
        12 kB
        Todd Lipcon
      13. hdfs-909.txt
        11 kB
        Todd Lipcon
      14. hdfs-909-unittest.txt
        6 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Hi Cosmin,

          Does this exist in branch-20 as well? or is it a regression in 20 when the edit log code was changed to support backupnode etc?

          Show
          Todd Lipcon added a comment - Hi Cosmin, Does this exist in branch-20 as well? or is it a regression in 20 when the edit log code was changed to support backupnode etc?
          Hide
          Cosmin Lehene added a comment -

          Hi Todd,

          I haven't check yet, so it's possible to be on 0.20 as well. I forgot to add that the issue particularly nasty because it will first fail silently. In our case, the log was corrupted on 17th of December but we only discovered it yesterday when we restarted HDFS. It can be detected early by monitoring the secondary-namenode.out log file.

          Show
          Cosmin Lehene added a comment - Hi Todd, I haven't check yet, so it's possible to be on 0.20 as well. I forgot to add that the issue particularly nasty because it will first fail silently. In our case, the log was corrupted on 17th of December but we only discovered it yesterday when we restarted HDFS. It can be detected early by monitoring the secondary-namenode.out log file.
          Hide
          Cosmin Lehene added a comment -

          I looked briefly over 0.20 branch and it seems it's susceptible to the race condition (not exactly the same as here). To simplify things a bit from the initial description:

          in 0.21 FSEditLog methods logSync() and closeStream() call the setReadyToFlush() freely
          in 0.20 FSEditLog.setReadyToFlush() is called by logSync() and close(). Apparently close() is synchronized and also checking if isSyncRunning is set. However it doesn't stop it to race for setReadyToFlush() against logSync(), but rather just attempting to wait a second if a sync is already in progress.

          I can see how these could be avoided with a shared lock in both 0.20 and 0.21+, however it might be better to get the code reviewed by someone that knows the adjacent parts as well too as it currently seems really fragile and prone to deadlock creation.

          Show
          Cosmin Lehene added a comment - I looked briefly over 0.20 branch and it seems it's susceptible to the race condition (not exactly the same as here). To simplify things a bit from the initial description: in 0.21 FSEditLog methods logSync() and closeStream() call the setReadyToFlush() freely in 0.20 FSEditLog.setReadyToFlush() is called by logSync() and close(). Apparently close() is synchronized and also checking if isSyncRunning is set. However it doesn't stop it to race for setReadyToFlush() against logSync(), but rather just attempting to wait a second if a sync is already in progress. I can see how these could be avoided with a shared lock in both 0.20 and 0.21+, however it might be better to get the code reviewed by someone that knows the adjacent parts as well too as it currently seems really fragile and prone to deadlock creation.
          Hide
          Todd Lipcon added a comment -

          I don't think I see this in 0.20. logSync is not itself a synchronized method, but there is a synchronized (this) block around the invocations of setReadyToFlush.

          Show
          Todd Lipcon added a comment - I don't think I see this in 0.20. logSync is not itself a synchronized method, but there is a synchronized (this) block around the invocations of setReadyToFlush.
          Hide
          Todd Lipcon added a comment -

          Here's a bit of an ugly unit test that I think shows this issue on trunk. I haven't reproduced exactly Cosmin's behavior, but I do manage to get it to throw an exception pretty quickly. The test starts up a bunch of threads which log actions, then alternates (a) roll edit log, (b) verify edits, (c) purge edits.

          On my system it fails quickly with:

              [junit] java.io.IOException: Bad file descriptor
              [junit]     at sun.nio.ch.FileChannelImpl.position0(Native Method)
              [junit]     at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:263)
              [junit]     at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync(EditLogFileOutputStream.java:145)
              [junit]     at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:83)
              [junit]     at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:867)
              [junit]     at org.apache.hadoop.hdfs.server.namenode.TestEditLogRace$Transactions.run(TestEditLogRace.java:76)
              [junit]     at java.lang.Thread.run(Thread.java:619)
          

          This isn't the same behavior Cosmin reported, so I may actually have an error in my test logic (eg not synchronizing on something that a real FSNamesystem would)

          Show
          Todd Lipcon added a comment - Here's a bit of an ugly unit test that I think shows this issue on trunk. I haven't reproduced exactly Cosmin's behavior, but I do manage to get it to throw an exception pretty quickly. The test starts up a bunch of threads which log actions, then alternates (a) roll edit log, (b) verify edits, (c) purge edits. On my system it fails quickly with: [junit] java.io.IOException: Bad file descriptor [junit] at sun.nio.ch.FileChannelImpl.position0(Native Method) [junit] at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:263) [junit] at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync(EditLogFileOutputStream.java:145) [junit] at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:83) [junit] at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:867) [junit] at org.apache.hadoop.hdfs.server.namenode.TestEditLogRace$Transactions.run(TestEditLogRace.java:76) [junit] at java.lang.Thread.run(Thread.java:619) This isn't the same behavior Cosmin reported, so I may actually have an error in my test logic (eg not synchronizing on something that a real FSNamesystem would)
          Hide
          Todd Lipcon added a comment -

          Here is a patch that fixes the issue.

          The included unit test isn't optimal, though it does tend to fail on the unpatched code, and always passes on the patched code. Since this is a race condition, it's difficult to force.

          After this is reviewed, I will post a similar patch against 0.21. I'll also port back the unit test to 0.20 to make absolutely sure the issue doesn't occur there.

          Show
          Todd Lipcon added a comment - Here is a patch that fixes the issue. The included unit test isn't optimal, though it does tend to fail on the unpatched code, and always passes on the patched code. Since this is a race condition, it's difficult to force. After this is reviewed, I will post a similar patch against 0.21. I'll also port back the unit test to 0.20 to make absolutely sure the issue doesn't occur there.
          Hide
          Todd Lipcon added a comment -

          Backported the test to 0.20, it passes just fine since roll and purge both call close(), which waits for any running syncs to complete first.

          Show
          Todd Lipcon added a comment - Backported the test to 0.20, it passes just fine since roll and purge both call close(), which waits for any running syncs to complete first.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12431737/hdfs-909.txt
          against trunk revision 903906.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12431737/hdfs-909.txt against trunk revision 903906. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/211/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Failing test is unrelated (been failing other patch builds, seems security related)

          Show
          Todd Lipcon added a comment - Failing test is unrelated (been failing other patch builds, seems security related)
          Hide
          Konstantin Shvachko added a comment -

          Cosmin, thanks for reporting and investigating this. I want to clarify the problem.
          Looks like that rollFSImage() occurred exactly between two synchronized sections in FSEditLog.logSync(), when logSync() was flushing one of the streams.
          setReadyToFlush() is called by logSync() in the first synchronized section, so it cannot race with calls setReadyToFlush() originated in FSEditLog.divertFileStreams(), because the latter is also synchronized.
          So the race condition in your case is between

          FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
          VERSUS
          FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
          

          setReadyToFlush() swaps the buffers bufCurrent and bufReady, and flushAndSync() writes bufReady into the edits file, and it could write inconsistent data if the buffer has been replaced while writing.

          Todd, it seems the exception you posted java.io.IOException: Bad file descriptor means that file has been closed before flush() could complete. Is it the same race or a different one.
          I haven't looked at the patch yet.

          Also, usually the description of the issue should be short. Detailed explanations can be posted in the first comment after that. Is it possible to edit the description now and remove those unwrapped long lines?

          Show
          Konstantin Shvachko added a comment - Cosmin, thanks for reporting and investigating this. I want to clarify the problem. Looks like that rollFSImage() occurred exactly between two synchronized sections in FSEditLog.logSync() , when logSync() was flushing one of the streams. setReadyToFlush() is called by logSync() in the first synchronized section, so it cannot race with calls setReadyToFlush() originated in FSEditLog.divertFileStreams() , because the latter is also synchronized. So the race condition in your case is between FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() VERSUS FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() setReadyToFlush() swaps the buffers bufCurrent and bufReady , and flushAndSync() writes bufReady into the edits file, and it could write inconsistent data if the buffer has been replaced while writing. Todd, it seems the exception you posted java.io.IOException: Bad file descriptor means that file has been closed before flush() could complete. Is it the same race or a different one. I haven't looked at the patch yet. Also, usually the description of the issue should be short. Detailed explanations can be posted in the first comment after that. Is it possible to edit the description now and remove those unwrapped long lines?
          Hide
          Todd Lipcon added a comment -

          Konstantin: I think it's pretty much the same race. The included unit test should shake out either one, and it passes now.

          The patch is pretty simple - revertFileStreams and divertFileStreams simply need to wait for any thread that's in the unsynchronized middle part of logSync.

          Show
          Todd Lipcon added a comment - Konstantin: I think it's pretty much the same race. The included unit test should shake out either one, and it passes now. The patch is pretty simple - revertFileStreams and divertFileStreams simply need to wait for any thread that's in the unsynchronized middle part of logSync.
          Hide
          Konstantin Shvachko added a comment -

          Todd: I agree the patch fixes this particular race condition, when rollFSImage() is involved.
          I fear there is a similar race here, which it does not solve

          FSNameSystem.saveNamespace() -> FSImage.saveFSImage( -> FSEditLog.createEditLogFile() -> EditLogFileOutputStream.create() -> EditLogFileOutputStream.setReadyToFlush() 
          VERSUS
          FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
          

          You can probably verify this with your test.
          I was thinking is there a universal way to avoid these type of errors.
          I mean somebody will write a new contravertFileStreams() method and will forget to waitForSyncToFinish(), then we will have to fix it again.
          May be we should synchronize on the EditLogFileOutputStream level, then flushAndSync() and setReadyToFlush() will not intervene with each other?
          In TestEditLogRace you call FSEditLog methods directly avoiding the FSNamesystem layer.
          Would it be better to involve the FSNamesystem lock into the picture?

          Show
          Konstantin Shvachko added a comment - Todd: I agree the patch fixes this particular race condition, when rollFSImage() is involved. I fear there is a similar race here, which it does not solve FSNameSystem.saveNamespace() -> FSImage.saveFSImage( -> FSEditLog.createEditLogFile() -> EditLogFileOutputStream.create() -> EditLogFileOutputStream.setReadyToFlush() VERSUS FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() You can probably verify this with your test. I was thinking is there a universal way to avoid these type of errors. I mean somebody will write a new contravertFileStreams() method and will forget to waitForSyncToFinish(), then we will have to fix it again. May be we should synchronize on the EditLogFileOutputStream level, then flushAndSync() and setReadyToFlush() will not intervene with each other? In TestEditLogRace you call FSEditLog methods directly avoiding the FSNamesystem layer. Would it be better to involve the FSNamesystem lock into the picture?
          Hide
          Todd Lipcon added a comment -

          May be we should synchronize on the EditLogFileOutputStream level, then flushAndSync() and setReadyToFlush() will not intervene with each other?

          If we do that, isn't there a worry that each stream could flush at a slightly different point?
          If we do HDFS-903 and add some kind of checksumming/verification to the edit logs (and we should)
          then we rely on them rolling together. (Maybe I'm misunderstanding the locking you're suggesting?)

          In TestEditLogRace you call FSEditLog methods directly avoiding the FSNamesystem layer. Would it be better to involve the FSNamesystem lock into the picture?

          I don't think the namesystem lock really matters, since all the write operations are separated into an internalFoo()
          which does the work while synchronized, followed by the logSync unsynchronized. I was trying to keep the test
          as close to a "unit test" as possible though I did have to use a minicluster anyway.

          I fear there is a similar race here, which it does not solve...

          Let me add a unit test for this one as well and see what I come up with. I had forgotten about saveNamespace. Thanks

          Show
          Todd Lipcon added a comment - May be we should synchronize on the EditLogFileOutputStream level, then flushAndSync() and setReadyToFlush() will not intervene with each other? If we do that, isn't there a worry that each stream could flush at a slightly different point? If we do HDFS-903 and add some kind of checksumming/verification to the edit logs (and we should) then we rely on them rolling together. (Maybe I'm misunderstanding the locking you're suggesting?) In TestEditLogRace you call FSEditLog methods directly avoiding the FSNamesystem layer. Would it be better to involve the FSNamesystem lock into the picture? I don't think the namesystem lock really matters, since all the write operations are separated into an internalFoo() which does the work while synchronized, followed by the logSync unsynchronized. I was trying to keep the test as close to a "unit test" as possible though I did have to use a minicluster anyway. I fear there is a similar race here, which it does not solve... Let me add a unit test for this one as well and see what I come up with. I had forgotten about saveNamespace. Thanks
          Hide
          Todd Lipcon added a comment -

          This patch adds a unit test for saveNamespace concurrent with edit ops.

          There is no race here because createEditLogFile is creating a new output stream, and thus won't race with any of the existing streams. The edit log isn't mutated until the image has been dumped and then the logs are rolled, so it's the same rollEditLogs fixed originally.

          As for the danger of future changes missing the subtlety, I added a giant comment at the top of FSEditLog in the hopes that it will warn people. A different design with multiple locks (one protecting the buffers, one protecting sync and stream ops) might make it harder to make this mistake, but I think what we have isn't so bad as long as it's documented.

          Show
          Todd Lipcon added a comment - This patch adds a unit test for saveNamespace concurrent with edit ops. There is no race here because createEditLogFile is creating a new output stream, and thus won't race with any of the existing streams. The edit log isn't mutated until the image has been dumped and then the logs are rolled, so it's the same rollEditLogs fixed originally. As for the danger of future changes missing the subtlety, I added a giant comment at the top of FSEditLog in the hopes that it will warn people. A different design with multiple locks (one protecting the buffers, one protecting sync and stream ops) might make it harder to make this mistake, but I think what we have isn't so bad as long as it's documented.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12434498/hdfs-909.txt
          against trunk revision 903906.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434498/hdfs-909.txt against trunk revision 903906. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/110/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Failing tests in test-patch are hudson BS per usual

          Show
          Todd Lipcon added a comment - Failing tests in test-patch are hudson BS per usual
          Hide
          Cosmin Lehene added a comment -
          completeFile
          FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E646972
          2F3336343035313634362F38333238313438373139303730333137323739000133000D3132363130383236333
          1383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E60000000
          0040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16A
          BF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001??
          

          followed by a rename operation

          0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343
          035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F757365727
          32F3336343035313634362F746573742F36393137333831323838333034343734333836000D313236313038323
          63331383639
          

          The first byte of the rename was instead read as part of the completeFile() operation. This resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which would have been 0x0000030 which is 768 that was read and failed in the following code

           case OP_ADD:
                  case OP_CLOSE: {
                    // versions > 0 support per file replication
                    // get name and replication
                    int length = in.readInt();
                    if (-7 == logVersion && length != 3||
                        -17 < logVersion && logVersion < -7 && length != 4 ||
                        logVersion <= -17 && length != 5) {
                        throw new IOException("Incorrect data format."  +
                                              " logVersion is " + logVersion +
                                              " but writables.length is " +
                                              length + ". ");
          

          Filling the missing byte with a value resulted in correct interpretation of the 0x01 (OP_RENAME) and correct parsing for the rest of the edits file (>1MB)

          This theory is also supported by the NameNode log file from the date the corruption took place:

          namenode.log
          2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.72.17.162
          2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
          2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
          2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]}
          2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
          2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
          2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
          2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
          
          
          
          
          
          2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667
          2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from 10.72.17.162
          2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop       ip=/10.72.17.166        cmd=rename      src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279     dst=/hbase/demo__users/364051646/test/6917381288304474386       perm=hadoop:supergroup:rw-r--r--
          

          The last 3 entries show a completeFile operation followed by a rollFSIMage operation followed by a rename operation. This is where most probably the race condition took place.
          Synchronizing access to EditLogOutputStream could fix the problem, however other race or deadlocks may still occur.

          Show
          Cosmin Lehene added a comment - completeFile FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E646972 2F3336343035313634362F38333238313438373139303730333137323739000133000D3132363130383236333 1383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E60000000 0040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16A BF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001?? followed by a rename operation 0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343 035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F757365727 32F3336343035313634362F746573742F36393137333831323838333034343734333836000D313236313038323 63331383639 The first byte of the rename was instead read as part of the completeFile() operation. This resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which would have been 0x0000030 which is 768 that was read and failed in the following code case OP_ADD: case OP_CLOSE: { // versions > 0 support per file replication // get name and replication int length = in.readInt(); if (-7 == logVersion && length != 3|| -17 < logVersion && logVersion < -7 && length != 4 || logVersion <= -17 && length != 5) { throw new IOException( "Incorrect data format." + " logVersion is " + logVersion + " but writables.length is " + length + ". " ); Filling the missing byte with a value resulted in correct interpretation of the 0x01 (OP_RENAME) and correct parsing for the rest of the edits file (>1MB) This theory is also supported by the NameNode log file from the date the corruption took place: namenode.log 2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.72.17.162 2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} 2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from 10.72.17.162 2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop ip=/10.72.17.166 cmd=rename src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279 dst=/hbase/demo__users/364051646/test/6917381288304474386 perm=hadoop:supergroup:rw-r--r-- The last 3 entries show a completeFile operation followed by a rollFSIMage operation followed by a rename operation. This is where most probably the race condition took place. Synchronizing access to EditLogOutputStream could fix the problem, however other race or deadlocks may still occur.
          Hide
          Cosmin Lehene added a comment -

          @Konstantin I moved the details to a comment and broke it on more lines but I missed the log entry that really messes up the layout. I can't edit the comment, unfortunately so if you can please break the log entry lines in my comment to have decent layout on this page. Sorry and thanks.

          PS I'll look at the code again to see the race issue you described.

          Show
          Cosmin Lehene added a comment - @Konstantin I moved the details to a comment and broke it on more lines but I missed the log entry that really messes up the layout. I can't edit the comment, unfortunately so if you can please break the log entry lines in my comment to have decent layout on this page. Sorry and thanks. PS I'll look at the code again to see the race issue you described.
          Hide
          Konstantin Boudnik added a comment -

          This patch adds a unit test for saveNamespace concurrent with edit ops.

          Would it be possible to make this test according to JUnit4 conventions?

          Show
          Konstantin Boudnik added a comment - This patch adds a unit test for saveNamespace concurrent with edit ops. Would it be possible to make this test according to JUnit4 conventions?
          Hide
          Todd Lipcon added a comment -

          New patch in junit 4 style

          Show
          Todd Lipcon added a comment - New patch in junit 4 style
          Hide
          Konstantin Shvachko added a comment -

          Todd I checked the race with -saveNamespace and logSync is there.

          • I start the name-node in debugger.
          • I do -mkdir and stop the debugger in logSync() just before it does flush.
          • Then I enter safe mode with another client
          • I start saveNamepsace and stop the debugger in FSImage.saveFSImage() -> FSEditLog.createEditLogFile() -> EditLogFileOutputStream.create() -> after truncating the file but before writing LAYOUT_VERSION into it.
          • Then I let logSync() run.
          • Then I terminate the name-node.
          • After that the name-node wont start, since the edits file is broken.

          In short the problem is that we get two file descriptors to the same file and modifying it simultaneously through two different threads. The results are expected to be unpredictable.

          Cosmin, thanks for editing. Comments are not editable, so we will have to live with what we've got. It's ok now.

          Show
          Konstantin Shvachko added a comment - Todd I checked the race with -saveNamespace and logSync is there. I start the name-node in debugger. I do -mkdir and stop the debugger in logSync() just before it does flush. Then I enter safe mode with another client I start saveNamepsace and stop the debugger in FSImage.saveFSImage() -> FSEditLog.createEditLogFile() -> EditLogFileOutputStream.create() -> after truncating the file but before writing LAYOUT_VERSION into it. Then I let logSync() run. Then I terminate the name-node. After that the name-node wont start, since the edits file is broken. In short the problem is that we get two file descriptors to the same file and modifying it simultaneously through two different threads. The results are expected to be unpredictable. Cosmin, thanks for editing. Comments are not editable, so we will have to live with what we've got. It's ok now.
          Hide
          Todd Lipcon added a comment -

          Canceling patch while I look into this some more.

          Show
          Todd Lipcon added a comment - Canceling patch while I look into this some more.
          Hide
          Konstantin Shvachko added a comment -

          > I added a giant comment at the top of FSEditLog
          Minor thing. Could you pls move the comment down to logSync(). I think it belongs to the implementation of this method rather than to the class itself. The comment is good.

          Show
          Konstantin Shvachko added a comment - > I added a giant comment at the top of FSEditLog Minor thing. Could you pls move the comment down to logSync(). I think it belongs to the implementation of this method rather than to the class itself. The comment is good.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12434606/hdfs-909.txt
          against trunk revision 905760.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434606/hdfs-909.txt against trunk revision 905760. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Hi Konstantin. Thanks for the in depth exploration of that race.

          Let me ask a question - sorry if I'm being dense here

          In FSImage.saveFSImage(boolean) (FSImage.java:1243) we create a new "current" edit log file before rolling the FSImage.
          This seems to be related to the race, and I don't quite understand the design here. It seems to me that the correct operation
          of saveFSImage should be:

          1. Dump the image in IMAGE_NEW
          2. Create a new empty edit log in EDITS_NEW (while leaving EDITS alone)
          3. Call purgeEditLogs which rolls the new image/edits pair in.

          Analyzing the failure cases:

          • During step 1, if the NN restarts, it can recover from the preexisting IMAGE/EDITS as if the saveFSImage never started
          • Between step 1 and 2, it can recover from either the preexisting IMAGE/EDITS or the IMAGE_NEW.
            IMAGE_NEW is consistent and up to date because saveFSImage is synchronized, and thus it represents the exact state
            of the namesystem.
          • The same goes for step 2 - the empty edit log is correct against the new log. If we decide to recover from current instead
            of NEW at this point it would also be correct.
          • We must assume that step 3 performs rolling in a way that allows recovery - this is what allows rolling to work in general
            safely.

          In the implementation as I understand it, step 2 is implemented differently such that it calls createEditLogFile on EDITS, not
          just EDITS_NEW. This truncates the edits, which means we can only recover from the NEW image at this point. This makes
          for tricky recovery logic, since we have an old image with truncated edit log, plus a new image (which we don't have any great way
          of knowing is complete). Additionally, I don't think there's a guarantee that steps 1 and 2 happen in that order - the order of
          saveFSImage is determined by the order of the storage directories in the directory iterator. It seems like edits usually
          come last, but in the case of failed/recovered storage directories, I don't think it's always true. If the order of step 1 and 2 are
          inverted, it could truncate the current edits log before saving the new image, and thus end up losing all those edits if the
          NN failed in between.

          Am I misunderstanding something? The simple fix to the issue you suggested above is to add a waitForSyncToFinish in
          createEditLogFile, but I'm worried there may be more subtle issues as described above.

          Thanks!

          Show
          Todd Lipcon added a comment - Hi Konstantin. Thanks for the in depth exploration of that race. Let me ask a question - sorry if I'm being dense here In FSImage.saveFSImage(boolean) (FSImage.java:1243) we create a new "current" edit log file before rolling the FSImage. This seems to be related to the race, and I don't quite understand the design here. It seems to me that the correct operation of saveFSImage should be: 1. Dump the image in IMAGE_NEW 2. Create a new empty edit log in EDITS_NEW (while leaving EDITS alone) 3. Call purgeEditLogs which rolls the new image/edits pair in. Analyzing the failure cases: During step 1, if the NN restarts, it can recover from the preexisting IMAGE/EDITS as if the saveFSImage never started Between step 1 and 2, it can recover from either the preexisting IMAGE/EDITS or the IMAGE_NEW. IMAGE_NEW is consistent and up to date because saveFSImage is synchronized, and thus it represents the exact state of the namesystem. The same goes for step 2 - the empty edit log is correct against the new log. If we decide to recover from current instead of NEW at this point it would also be correct. We must assume that step 3 performs rolling in a way that allows recovery - this is what allows rolling to work in general safely. In the implementation as I understand it, step 2 is implemented differently such that it calls createEditLogFile on EDITS, not just EDITS_NEW. This truncates the edits, which means we can only recover from the NEW image at this point. This makes for tricky recovery logic, since we have an old image with truncated edit log, plus a new image (which we don't have any great way of knowing is complete). Additionally, I don't think there's a guarantee that steps 1 and 2 happen in that order - the order of saveFSImage is determined by the order of the storage directories in the directory iterator. It seems like edits usually come last, but in the case of failed/recovered storage directories, I don't think it's always true. If the order of step 1 and 2 are inverted, it could truncate the current edits log before saving the new image, and thus end up losing all those edits if the NN failed in between. Am I misunderstanding something? The simple fix to the issue you suggested above is to add a waitForSyncToFinish in createEditLogFile, but I'm worried there may be more subtle issues as described above. Thanks!
          Hide
          Konstantin Shvachko added a comment -

          Todd, EDITS_NEW is created when the checkpoint starts.
          After that all new transactions will go into EDITS_NEW, while EDITS will remains untouched.
          So if you empty EDITS_NEW in your step (2) you basically wipe out latest transactions in HDFS.
          Does that make sense?

          Show
          Konstantin Shvachko added a comment - Todd, EDITS_NEW is created when the checkpoint starts. After that all new transactions will go into EDITS_NEW, while EDITS will remains untouched. So if you empty EDITS_NEW in your step (2) you basically wipe out latest transactions in HDFS. Does that make sense?
          Hide
          Todd Lipcon added a comment -

          I think we're talking about different things. I'm discussing the saveFSImage that is called by
          FSNamesystem.saveNamespace, where you found the race above.

          FSImage.saveFSImage has this code:

          1240       if (dirType.isOfType(NameNodeDirType.IMAGE))
          1241         saveFSImage(getImageFile(sd, NameNodeFile.IMAGE_NEW));
          1242       if (dirType.isOfType(NameNodeDirType.EDITS)) {
          1243         editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS));
          1244         File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW);
          1245         if (editsNew.exists())
          1246           editLog.createEditLogFile(editsNew);
          1247       }
          

          On line 1243 we truncate EDITS. Then if EDITS_NEW exists, we truncate it on 1246. All of this happens when
          the NN is in safe mode, so there shouldn't be any new edits coming in in the first place.

          I'm contending that line 1243 and 1245 should both be deleted. We should always create the image as
          IMAGE_NEW (line 1241). Touching EDITS seems incorrect - what if the order of storage dirs is EDITS
          then IMAGE, so we run line 1243, kill our current edit log, and then crash before saving the current image?

          (this is possibly orthogonal to the issue you raised - even though there are no edits, there can be an ongoing sync.
          I think the NN should call editLog.waitForSyncToFinish before entering safemode to avoid this issue)

          Show
          Todd Lipcon added a comment - I think we're talking about different things. I'm discussing the saveFSImage that is called by FSNamesystem.saveNamespace, where you found the race above. FSImage.saveFSImage has this code: 1240 if (dirType.isOfType(NameNodeDirType.IMAGE)) 1241 saveFSImage(getImageFile(sd, NameNodeFile.IMAGE_NEW)); 1242 if (dirType.isOfType(NameNodeDirType.EDITS)) { 1243 editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS)); 1244 File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW); 1245 if (editsNew.exists()) 1246 editLog.createEditLogFile(editsNew); 1247 } On line 1243 we truncate EDITS. Then if EDITS_NEW exists, we truncate it on 1246. All of this happens when the NN is in safe mode, so there shouldn't be any new edits coming in in the first place. I'm contending that line 1243 and 1245 should both be deleted. We should always create the image as IMAGE_NEW (line 1241). Touching EDITS seems incorrect - what if the order of storage dirs is EDITS then IMAGE, so we run line 1243, kill our current edit log, and then crash before saving the current image? (this is possibly orthogonal to the issue you raised - even though there are no edits, there can be an ongoing sync. I think the NN should call editLog.waitForSyncToFinish before entering safemode to avoid this issue)
          Hide
          Konstantin Shvachko added a comment -

          > NN is in safe mode, so there shouldn't be any new edits coming in in the first place.

          True no new edits, but those started before entering safemode can still run.

          > what if the order of storage dirs is EDITS then IMAGE, so we kill our current edit log, and then crash before saving the current image?

          This is a reasonable concern.
          Historically fsimage and edits used to be always in the same directory, so we first saved the image
          in the directory and then reset the edits. Now the order may change, and this was not taken care of.
          We should file a new jira for fixing this.

          Show
          Konstantin Shvachko added a comment - > NN is in safe mode, so there shouldn't be any new edits coming in in the first place. True no new edits, but those started before entering safemode can still run. > what if the order of storage dirs is EDITS then IMAGE, so we kill our current edit log, and then crash before saving the current image? This is a reasonable concern. Historically fsimage and edits used to be always in the same directory, so we first saved the image in the directory and then reset the edits. Now the order may change, and this was not taken care of. We should file a new jira for fixing this.
          Hide
          Todd Lipcon added a comment -

          I think next step is to write a unit test which tries to trigger the race you found above, and fix it with an additional
          call to waitForSyncToFinish in an appropriate place. Should be able to get to that this week.

          I'll open another JIRA for the saveFSImage behavior described above.

          Sound good?

          Show
          Todd Lipcon added a comment - I think next step is to write a unit test which tries to trigger the race you found above, and fix it with an additional call to waitForSyncToFinish in an appropriate place. Should be able to get to that this week. I'll open another JIRA for the saveFSImage behavior described above. Sound good?
          Hide
          Konstantin Shvachko added a comment -

          Sounds good. I think we will need to call waitForSyncToFinish() both before entering safe mode and in createEditLogFile().

          Show
          Konstantin Shvachko added a comment - Sounds good. I think we will need to call waitForSyncToFinish() both before entering safe mode and in createEditLogFile() .
          Hide
          Todd Lipcon added a comment -

          Unfortunately, I wasn't able to write a unit test that reproduces your race. This is because the race only
          occurs if the NN exits before rollEditLogs() is called at the end of FSImage.saveFSImage(boolean) –
          the race induces corruiption in EDITS, but EDITS_NEW is a correct empty file. rollEditLogs thus fixes
          up the state of the file.

          I think we'll deal with this issue in the other JIRA regarding saveFSImage operation.

          think we will need to call waitForSyncToFinish() both before entering safe mode

          I think it's actually impossible to be correct here. The issue with waitForSyncToFinish in enterSafeMode is that many
          of the FSNamesystem calls have a structure that looks like:

          1 void someOperation() {
          2   synchronized (this) {
          3     if (!isInSafeMode()) { explode; }
          4     internalSomeOperation();
          5   }
          6   getEditLog().logSync();
          7 }
          

          If we call enterSafeMode between line 5 and 6, a waitForSyncToFinish would return immediately,
          since the sync isn't running yet. Really this is the case for any of lines 3-6 since enterSafeMode
          is synchronized as well.

          I think we need an additional method with stronger guarantees than waitForSyncToFinish - something like
          syncAllOutstandingOperations that waits until synctxid == txid.

          What do you think?

          Show
          Todd Lipcon added a comment - Unfortunately, I wasn't able to write a unit test that reproduces your race. This is because the race only occurs if the NN exits before rollEditLogs() is called at the end of FSImage.saveFSImage(boolean) – the race induces corruiption in EDITS, but EDITS_NEW is a correct empty file. rollEditLogs thus fixes up the state of the file. I think we'll deal with this issue in the other JIRA regarding saveFSImage operation. think we will need to call waitForSyncToFinish() both before entering safe mode I think it's actually impossible to be correct here. The issue with waitForSyncToFinish in enterSafeMode is that many of the FSNamesystem calls have a structure that looks like: 1 void someOperation() { 2 synchronized ( this ) { 3 if (!isInSafeMode()) { explode; } 4 internalSomeOperation(); 5 } 6 getEditLog().logSync(); 7 } If we call enterSafeMode between line 5 and 6, a waitForSyncToFinish would return immediately, since the sync isn't running yet. Really this is the case for any of lines 3-6 since enterSafeMode is synchronized as well. I think we need an additional method with stronger guarantees than waitForSyncToFinish - something like syncAllOutstandingOperations that waits until synctxid == txid. What do you think?
          Hide
          Konstantin Shvachko added a comment -

          > Unfortunately, I wasn't able to write a unit test that reproduces your race.

          Yes unit test is hard here.
          But you can reproduce the problem manually, then fix it and verify the same sequence of steps does not corrupt the edits.
          This is a perfectly acceptable way of doing things imo, and in line with practices in this community.

          > I think we need an additional method with stronger guarantees than waitForSyncToFinish

          What if we just call getEditLog().logSync(); before entering safe mode.
          If it is a synchronized call it should guarantee all syncs are complete when we enter safe mode.

          Show
          Konstantin Shvachko added a comment - > Unfortunately, I wasn't able to write a unit test that reproduces your race. Yes unit test is hard here. But you can reproduce the problem manually, then fix it and verify the same sequence of steps does not corrupt the edits. This is a perfectly acceptable way of doing things imo, and in line with practices in this community. > I think we need an additional method with stronger guarantees than waitForSyncToFinish What if we just call getEditLog().logSync(); before entering safe mode. If it is a synchronized call it should guarantee all syncs are complete when we enter safe mode.
          Hide
          Todd Lipcon added a comment -

          But you can reproduce the problem manually, then fix it and verify the same sequence of steps does not corrupt the edits.
          This is a perfectly acceptable way of doing things imo, and in line with practices in this community.

          For less critical code paths, I agree, but for potential data loss I really would like to figure out an automatic test here.
          I'll take another swing at a unit test - I often find after a few days away a second try is a lot easier than the first.

          What if we just call getEditLog().logSync(); before entering safe mode.

          That won't work unless we also write something in the log so that the ThreadLocal transaction ID is updated, right?

          Show
          Todd Lipcon added a comment - But you can reproduce the problem manually, then fix it and verify the same sequence of steps does not corrupt the edits. This is a perfectly acceptable way of doing things imo, and in line with practices in this community. For less critical code paths, I agree, but for potential data loss I really would like to figure out an automatic test here. I'll take another swing at a unit test - I often find after a few days away a second try is a lot easier than the first. What if we just call getEditLog().logSync(); before entering safe mode. That won't work unless we also write something in the log so that the ThreadLocal transaction ID is updated, right?
          Hide
          Konstantin Boudnik added a comment -

          But you can reproduce the problem manually, then fix it and verify the same sequence of steps does not corrupt the edits.

          This is a perfectly acceptable way of doing things imo, and in line with practices in this community.

          Except that there won't be a regression test which might be very handy when a consequent change will affect the fix in question. And today's fix might deteriorate silently.

          Show
          Konstantin Boudnik added a comment - But you can reproduce the problem manually, then fix it and verify the same sequence of steps does not corrupt the edits. This is a perfectly acceptable way of doing things imo, and in line with practices in this community. Except that there won't be a regression test which might be very handy when a consequent change will affect the fix in question. And today's fix might deteriorate silently.
          Hide
          Todd Lipcon added a comment -

          Here's a patch which fixes the problem and also includes some more unit tests. It depends on HADOOP-6554 to pass.

          One thing I'd like review comments on:

          Most of the FSNamesystem calls have the general form:

          void xxx() {
            synchronized (this) {
              doXXX(); // mutate state
              getEditLog().logXXX(); // log edits
            }
            getEditLog().logSync();
          }
          

          FSEditLog.waitForAllOngoingEdits() exists to wait for any threads that have written edits but not yet synced them. This ensures that once we enter safe mode, all edits are safely in the log and we can be sure that the log is no longer changing.

          However, iff there were an FSNamesystem call that looked like this:

          void xxx() {
            synchronized (this) {
              getEditLog().logXXX();
            }
            ...
            synchronized (this) {
              getEditLog().logXXX();
            }
            logSync();
          }
          

          then we could deadlock if enterSafeMode() were called in between the two synchronized sections, since it would hold the FSN lock while waiting for the logSync(), meanwhile xxx() is waiting to get the lock.

          Is this likely to be a problem? Any ideas how to avoid it?

          Show
          Todd Lipcon added a comment - Here's a patch which fixes the problem and also includes some more unit tests. It depends on HADOOP-6554 to pass. One thing I'd like review comments on: Most of the FSNamesystem calls have the general form: void xxx() { synchronized ( this ) { doXXX(); // mutate state getEditLog().logXXX(); // log edits } getEditLog().logSync(); } FSEditLog.waitForAllOngoingEdits() exists to wait for any threads that have written edits but not yet synced them. This ensures that once we enter safe mode, all edits are safely in the log and we can be sure that the log is no longer changing. However, iff there were an FSNamesystem call that looked like this: void xxx() { synchronized ( this ) { getEditLog().logXXX(); } ... synchronized ( this ) { getEditLog().logXXX(); } logSync(); } then we could deadlock if enterSafeMode() were called in between the two synchronized sections, since it would hold the FSN lock while waiting for the logSync(), meanwhile xxx() is waiting to get the lock. Is this likely to be a problem? Any ideas how to avoid it?
          Hide
          dhruba borthakur added a comment -

          I agree with Todd's latest observation. i think we should break up the FSNameSystem lock into a read lock and a write lock. The FSNamesystem.getWriteLock() will presumably be exactly similar to the synchronized sections of the current code.
          The above two pieces of code can then we modified as follows:

          void xxx() {
            getWriteLock();
            try {
              doXXX(); // mutate state
              getEditLog().logXXX(); // log edits
            } finally {
              writeUnlock();
            }
            getEditLog().logSync();
          }
          

          and the second piece via:

          void xxx() {
            getReadLock();
              ....
               getWriteLock()
               ....
               try {
                getEditLog().l)ogXXX();
              } finally {
                writeUnlock()
              }
            ...
               getWriteLock();
                try  {
                getEditLog().logXXX();
              } finally {
                writeUnlock();
              }
              ...
              logSync();
              readUnlock();
          }
          

          In short, if a thread is inside any method in FSNamesystem, it should keep the readlock. The readlock essentially counts the number of threads executing inside FSNamesystem.

          The enterSafeMode() can be written as

          enterSafeMode() {
            getWriteLock();
            try {
             ....... set safemode flags
             } finally {
               writeUnlock()
             }
          }
          
          

          Breaking up the FSNamesystem lock also has the advantage that it makes the NN code more mult-threaded. For example, if there are multiple requests to getFileStatus() or getRelicationFactor they can all execute in parallel.

          Show
          dhruba borthakur added a comment - I agree with Todd's latest observation. i think we should break up the FSNameSystem lock into a read lock and a write lock. The FSNamesystem.getWriteLock() will presumably be exactly similar to the synchronized sections of the current code. The above two pieces of code can then we modified as follows: void xxx() { getWriteLock(); try { doXXX(); // mutate state getEditLog().logXXX(); // log edits } finally { writeUnlock(); } getEditLog().logSync(); } and the second piece via: void xxx() { getReadLock(); .... getWriteLock() .... try { getEditLog().l)ogXXX(); } finally { writeUnlock() } ... getWriteLock(); try { getEditLog().logXXX(); } finally { writeUnlock(); } ... logSync(); readUnlock(); } In short, if a thread is inside any method in FSNamesystem, it should keep the readlock. The readlock essentially counts the number of threads executing inside FSNamesystem. The enterSafeMode() can be written as enterSafeMode() { getWriteLock(); try { ....... set safemode flags } finally { writeUnlock() } } Breaking up the FSNamesystem lock also has the advantage that it makes the NN code more mult-threaded. For example, if there are multiple requests to getFileStatus() or getRelicationFactor they can all execute in parallel.
          Hide
          Todd Lipcon added a comment -

          Dhruba: nice idea about switching to an RWLock here. However, I think that's likely to be a larger project
          (or at least out of scope for this issue, which I think we ought to fix for branch-0.20). Do you think there
          are any potential deadlocks in the current code, given the observation above? i.e is it unsafe to commit
          this patch, and then later do the RWLock conversion to make it less error prone and get the other
          benefits you noted?

          Show
          Todd Lipcon added a comment - Dhruba: nice idea about switching to an RWLock here. However, I think that's likely to be a larger project (or at least out of scope for this issue, which I think we ought to fix for branch-0.20). Do you think there are any potential deadlocks in the current code, given the observation above? i.e is it unsafe to commit this patch, and then later do the RWLock conversion to make it less error prone and get the other benefits you noted?
          Hide
          dhruba borthakur added a comment -

          I agree that the RWlock idea is only for trunk.

          in 0.20, is there any code paths that does the following:

          void xxx() {
            synchronized (this) {
              getEditLog().logXXX();
            }
            ...
            synchronized (this) {
              getEditLog().logXXX();
            }
            logSync();
          }
          

          if there re no such code paths, then there is no chance of deadlock. Otherwise, we migth want a fix for 0.20

          Show
          dhruba borthakur added a comment - I agree that the RWlock idea is only for trunk. in 0.20, is there any code paths that does the following: void xxx() { synchronized ( this ) { getEditLog().logXXX(); } ... synchronized ( this ) { getEditLog().logXXX(); } logSync(); } if there re no such code paths, then there is no chance of deadlock. Otherwise, we migth want a fix for 0.20
          Hide
          Todd Lipcon added a comment -

          I thought about this a bit more deeply over dinner I think the potential issue is not in fact a deadlock,
          since waitForAllOngoingEdits is a bit misnamed. In fact, it just calls through to logSync, so in the
          "bad code" example above, what will happen is that we sync the first synchronized block's edit for it,
          and then enter safe mode before entering the second synchronized block.

          So, to be correct, code just needs to make sure that it always checks isInSafeMode() each time it enters
          synchronized (this)

          {...}

          , before making any edit. Dhruba, do you concur?

          Switching to the ReadWriteLock would be nice since we could enforce the safemode check as part of
          the lock acquisition. But that's definitely out of scope.

          So, I think this is a certain improvement with no deadlock potential. The error case described above is
          no worse than what currently happens. So, I think the patch is good, and we'll use HDFS-956 to fix the
          "always check safe mode after entering synchronized blocks" problems.

          Show
          Todd Lipcon added a comment - I thought about this a bit more deeply over dinner I think the potential issue is not in fact a deadlock, since waitForAllOngoingEdits is a bit misnamed. In fact, it just calls through to logSync, so in the "bad code" example above, what will happen is that we sync the first synchronized block's edit for it, and then enter safe mode before entering the second synchronized block. So, to be correct, code just needs to make sure that it always checks isInSafeMode() each time it enters synchronized (this) {...} , before making any edit. Dhruba, do you concur? Switching to the ReadWriteLock would be nice since we could enforce the safemode check as part of the lock acquisition. But that's definitely out of scope. So, I think this is a certain improvement with no deadlock potential. The error case described above is no worse than what currently happens. So, I think the patch is good, and we'll use HDFS-956 to fix the "always check safe mode after entering synchronized blocks" problems.
          Hide
          Cosmin Lehene added a comment -

          @Todd what's the state of this patch? This happens more often than I initially thought. Just hit it again.

          Show
          Cosmin Lehene added a comment - @Todd what's the state of this patch? This happens more often than I initially thought. Just hit it again.
          Hide
          Todd Lipcon added a comment -

          Hi Cosmin,

          I think the most recent patch is good, though it will need some munging to apply to branch-20.
          If you can apply it and test it out for us, that would be great!

          Show
          Todd Lipcon added a comment - Hi Cosmin, I think the most recent patch is good, though it will need some munging to apply to branch-20. If you can apply it and test it out for us, that would be great!
          Hide
          Cosmin Lehene added a comment -

          @Todd,

          Thanks! We're using 0.21

          Show
          Cosmin Lehene added a comment - @Todd, Thanks! We're using 0.21
          Hide
          dhruba borthakur added a comment -

          I think this problem does not exist in 0.20. Both FSEditLog.close and FSEditLog.logSync invoke setReadyToFlush() within a synchronized section.

          Show
          dhruba borthakur added a comment - I think this problem does not exist in 0.20. Both FSEditLog.close and FSEditLog.logSync invoke setReadyToFlush() within a synchronized section.
          Hide
          dhruba borthakur added a comment -

          One problem I can see in 0.20 is that isSyncRunning is not a volatile variable. Can this cause a problem?

          Show
          dhruba borthakur added a comment - One problem I can see in 0.20 is that isSyncRunning is not a volatile variable. Can this cause a problem?
          Hide
          Todd Lipcon added a comment -

          I think this problem does not exist in 0.20. Both FSEditLog.close and FSEditLog.logSync invoke setReadyToFlush() within a synchronized section.

          I also think we're safe from the issue Cosmin is seeing, since close() waits for the sync to complete before moving on. But we may be susceptible to Konstantin's race, since
          enterSafeMode doesn't have the right wait behavior.

          I hope to get a chance to backport my current set of tests to 20 and see what the minimal set of changes are to fix, if they expose issues.

          One problem I can see in 0.20 is that isSyncRunning is not a volatile variable. Can this cause a problem?

          I don't think so, because in this case isSyncRunning is only accessed inside synchronized blocks. It may not need to be volatile even in trunk -
          I'll check that out too.

          Show
          Todd Lipcon added a comment - I think this problem does not exist in 0.20. Both FSEditLog.close and FSEditLog.logSync invoke setReadyToFlush() within a synchronized section. I also think we're safe from the issue Cosmin is seeing, since close() waits for the sync to complete before moving on. But we may be susceptible to Konstantin's race, since enterSafeMode doesn't have the right wait behavior. I hope to get a chance to backport my current set of tests to 20 and see what the minimal set of changes are to fix, if they expose issues. One problem I can see in 0.20 is that isSyncRunning is not a volatile variable. Can this cause a problem? I don't think so, because in this case isSyncRunning is only accessed inside synchronized blocks. It may not need to be volatile even in trunk - I'll check that out too.
          Hide
          dhruba borthakur added a comment -

          I recently experienced the race condition between saveNamespace and logSync methods. The problem is that the last transaction is still in progress even though the namenode has entered safemode.
          The last transaction in the edits log gets corrupted.

          Show
          dhruba borthakur added a comment - I recently experienced the race condition between saveNamespace and logSync methods. The problem is that the last transaction is still in progress even though the namenode has entered safemode. The last transaction in the edits log gets corrupted.
          Hide
          Todd Lipcon added a comment -

          OK. I will target a patch against branch-20 as well, then. Would you mind reviewing the top patch against trunk?

          Show
          Todd Lipcon added a comment - OK. I will target a patch against branch-20 as well, then. Would you mind reviewing the top patch against trunk?
          Hide
          dhruba borthakur added a comment -

          I also posted a patch just fr the savenamespace-safemode race via HDFS-988. appreciate ur feedback on that one.

          Show
          dhruba borthakur added a comment - I also posted a patch just fr the savenamespace-safemode race via HDFS-988 . appreciate ur feedback on that one.
          Hide
          Todd Lipcon added a comment -

          Patch updated against trunk. Verified that without the patch, the unit test fails, and passes after patch.

          Show
          Todd Lipcon added a comment - Patch updated against trunk. Verified that without the patch, the unit test fails, and passes after patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12440954/hdfs-909.txt
          against trunk revision 931338.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12440954/hdfs-909.txt against trunk revision 931338. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/304/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Looks like hudson's just being crazy – failed tests are unrelated. Do we need to resubmit?

          Show
          Todd Lipcon added a comment - Looks like hudson's just being crazy – failed tests are unrelated. Do we need to resubmit?
          Hide
          Todd Lipcon added a comment -

          Can someone please take a look at this blocker? I've verified it still applies against trunk.
          The new tests fail when run against unpatched trunk and pass when I apply the rest
          of the patch.

          Show
          Todd Lipcon added a comment - Can someone please take a look at this blocker? I've verified it still applies against trunk. The new tests fail when run against unpatched trunk and pass when I apply the rest of the patch.
          Hide
          Konstantin Shvachko added a comment -

          This patch has intersections with HDFS-988. Could you please clarify the dependency?
          As it looks now we need both of them to get it right.
          Does Dhruba plan to continue with HDFS-988 or does it make sense to incorporate them?

          Show
          Konstantin Shvachko added a comment - This patch has intersections with HDFS-988 . Could you please clarify the dependency? As it looks now we need both of them to get it right. Does Dhruba plan to continue with HDFS-988 or does it make sense to incorporate them?
          Hide
          Konstantin Shvachko added a comment -

          The patch looks good. A few nits.

          1. The giant comment should be a javadoc for logSync(). Right now it is attached to a private variable isSyncRunning.
          2. waitForAllOngoingEdits() should not be public.
          3. If I were to choose between names and logSyncAll() for the new method I'd prefer Dhurba's logSyncAll().
          4. Instead of increasing visibility of FSEditLog.editStreams I'd rather introduce getter or setter methods. You need only for tests, right?
          Show
          Konstantin Shvachko added a comment - The patch looks good. A few nits. The giant comment should be a javadoc for logSync(). Right now it is attached to a private variable isSyncRunning . waitForAllOngoingEdits() should not be public. If I were to choose between names and logSyncAll() for the new method I'd prefer Dhurba's logSyncAll() . Instead of increasing visibility of FSEditLog.editStreams I'd rather introduce getter or setter methods. You need only for tests, right?
          Hide
          Konstantin Shvachko added a comment -

          Sorry, (3) should read:
          3. If I were to choose between names waitForAllOngoingEdits() and logSyncAll() for the new method I'd prefer Dhurba's logSyncAll().

          Show
          Konstantin Shvachko added a comment - Sorry, (3) should read: 3. If I were to choose between names waitForAllOngoingEdits() and logSyncAll() for the new method I'd prefer Dhurba's logSyncAll() .
          Hide
          Todd Lipcon added a comment -

          Thanks for the review. New patch addresses your comments.

          Show
          Todd Lipcon added a comment - Thanks for the review. New patch addresses your comments.
          Hide
          Konstantin Shvachko added a comment -

          +1 for the patch.
          The interconnection with HDFS-988 is still not clear, but this should not block committing this, imo.

          Show
          Konstantin Shvachko added a comment - +1 for the patch. The interconnection with HDFS-988 is still not clear, but this should not block committing this, imo.
          Hide
          Todd Lipcon added a comment -

          yep, I will update the patch at 988 once this is committed.

          Show
          Todd Lipcon added a comment - yep, I will update the patch at 988 once this is committed.
          Hide
          dhruba borthakur added a comment -

          > Does Dhruba plan to continue with HDFS-988 or does it make sense to incorporate them?

          Todd, if you have the time and energy, please feel free to reassign HDFS-988 to yourself. Thanks.

          Show
          dhruba borthakur added a comment - > Does Dhruba plan to continue with HDFS-988 or does it make sense to incorporate them? Todd, if you have the time and energy, please feel free to reassign HDFS-988 to yourself. Thanks.
          Hide
          Konstantin Shvachko added a comment -

          I see it passed Hudson
          http://hudson.zones.apache.org/hudson/view/Hdfs/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/153/
          with one failed test TestDiskError.testLocalDirs.
          When I run it on my box it does not fail, and it failed on Hudson in some other builds.
          I don't think this is related to the patch.

          Todd, this marked for committing to 0.21 and 0.20.
          Could you please compile the respective patches, and make sure they pass tests on their branches.
          I will commit it then.

          Show
          Konstantin Shvachko added a comment - I see it passed Hudson http://hudson.zones.apache.org/hudson/view/Hdfs/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/153/ with one failed test TestDiskError.testLocalDirs. When I run it on my box it does not fail, and it failed on Hudson in some other builds. I don't think this is related to the patch. Todd, this marked for committing to 0.21 and 0.20. Could you please compile the respective patches, and make sure they pass tests on their branches. I will commit it then.
          Hide
          Todd Lipcon added a comment -

          Here is a patch against branch-0.20. All four tests pass. If I revert the fix but keep the test, 3/4 of them fail (one times out after an exception causes the edit log to get stuck in sync and the others fail outright)

          I am not planning on doing one for branch-0.21, since as I understand it we are abandoning that branch in favor of a release based on what's now trunk.

          One question: the sync() in trunk has a nice fix that adds a finally clause to reset the isSyncRunning flag in case an exception got thrown. Should we backport this fix as well?

          Show
          Todd Lipcon added a comment - Here is a patch against branch-0.20. All four tests pass. If I revert the fix but keep the test, 3/4 of them fail (one times out after an exception causes the edit log to get stuck in sync and the others fail outright) I am not planning on doing one for branch-0.21, since as I understand it we are abandoning that branch in favor of a release based on what's now trunk. One question: the sync() in trunk has a nice fix that adds a finally clause to reset the isSyncRunning flag in case an exception got thrown. Should we backport this fix as well?
          Hide
          Konstantin Shvachko added a comment -

          Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is the patch.

          Show
          Konstantin Shvachko added a comment - Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is the patch.
          Hide
          Konstantin Shvachko added a comment -

          Todd, I tried to run TestEditLogRace with your 0.20 patch. It runs forever and finaly times out.
          Feels like it does a lot of transactions. Could you please verify.
          In the trunk the same test runs 42 secs.
          Also you have some debug printouts in the patch, like "===== CLOSE DONE", and you use FSnamesystem.LOG for logging in the test.
          The latter is confising, as then you'd expect a message from FSNamesystem while it comes from the test.

          Show
          Konstantin Shvachko added a comment - Todd, I tried to run TestEditLogRace with your 0.20 patch. It runs forever and finaly times out. Feels like it does a lot of transactions. Could you please verify. In the trunk the same test runs 42 secs. Also you have some debug printouts in the patch, like "===== CLOSE DONE", and you use FSnamesystem.LOG for logging in the test. The latter is confising, as then you'd expect a message from FSNamesystem while it comes from the test.
          Hide
          Konstantin Shvachko added a comment -

          Some more:
          What org.apache.tools.ant.taskdefs.WaitFor is used for?
          And there is an blank line change at the end of FSEditLog.

          Show
          Konstantin Shvachko added a comment - Some more: What org.apache.tools.ant.taskdefs.WaitFor is used for? And there is an blank line change at the end of FSEditLog.
          Hide
          Todd Lipcon added a comment -

          Hi Konstantin,

          Thanks for the review. It does seem like the test for branch-20 occasionally fails - I had it passing here, but it's flaky and
          doesn't pass every time. Let me dig into this and upload a new fixed patch.

          What org.apache.tools.ant.taskdefs.WaitFor is used for?

          No idea where this came from. I've been trying out Eclipse recently instead of my usual vim, and haven't gotten used to
          clean up after its "smarts" Will doublecheck the next patch for such cruft as well.

          Show
          Todd Lipcon added a comment - Hi Konstantin, Thanks for the review. It does seem like the test for branch-20 occasionally fails - I had it passing here, but it's flaky and doesn't pass every time. Let me dig into this and upload a new fixed patch. What org.apache.tools.ant.taskdefs.WaitFor is used for? No idea where this came from. I've been trying out Eclipse recently instead of my usual vim, and haven't gotten used to clean up after its "smarts" Will doublecheck the next patch for such cruft as well.
          Hide
          Todd Lipcon added a comment -

          It turns out the test on trunk was flaky as well. The issue was that we were calling saveNamespace directly on the FSImage while also performing edits from the Transactions threads. This is exactly the behavior we're trying to avoid by forcing the NN into safemode first. Also, we were calling verifyEdits() on an edit log that was being simultaneously written to, which is likely to fail if it reads a partial edit.

          This patch against trunk does the following:

          • Bumps up the number of rolls and saves to 30 instead of 10, since obviously 10 wasn't enough to have it fail reliably.
          • Replaces use of the FSN log with the test's own log
          • Changes the transaction threads to operate via FSN rather than logging directly to the edit log.
          • Any exceptions thrown by the edits will cause the test to properly fail

          To verify this fix, I temporarily bumped the constants for number of rolls up to 200 and checked that it passed.

          This failed sometimes for me without HADOOP-6717, a trivial patch which reduces the amount of log output from new security code.

          I'll separately amend the branch-20 patch with the same changes.

          Show
          Todd Lipcon added a comment - It turns out the test on trunk was flaky as well. The issue was that we were calling saveNamespace directly on the FSImage while also performing edits from the Transactions threads. This is exactly the behavior we're trying to avoid by forcing the NN into safemode first. Also, we were calling verifyEdits() on an edit log that was being simultaneously written to, which is likely to fail if it reads a partial edit. This patch against trunk does the following: Bumps up the number of rolls and saves to 30 instead of 10, since obviously 10 wasn't enough to have it fail reliably. Replaces use of the FSN log with the test's own log Changes the transaction threads to operate via FSN rather than logging directly to the edit log. Any exceptions thrown by the edits will cause the test to properly fail To verify this fix, I temporarily bumped the constants for number of rolls up to 200 and checked that it passed. This failed sometimes for me without HADOOP-6717 , a trivial patch which reduces the amount of log output from new security code. I'll separately amend the branch-20 patch with the same changes.
          Hide
          Konstantin Shvachko added a comment -

          FSEditLog,java imports org.apache.tools.ant.taskdefs.WaitFor in your patch for 0.20.
          As you see I've already committed the other two branches. So it would be good to finish this sooner than later. Thanks.

          Show
          Konstantin Shvachko added a comment - FSEditLog,java imports org.apache.tools.ant.taskdefs.WaitFor in your patch for 0.20. As you see I've already committed the other two branches. So it would be good to finish this sooner than later. Thanks.
          Hide
          Todd Lipcon added a comment -

          Updated branch-20 patch with same changes (plus cleanup of the changes I accidentally left in before)

          Show
          Todd Lipcon added a comment - Updated branch-20 patch with same changes (plus cleanup of the changes I accidentally left in before)
          Hide
          Todd Lipcon added a comment -

          Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is the patch.

          The plan for HBase 0.20.5 is to work against Tom's new 21 release or a 20 with HDFS-200 applied,
          not the current 21 branch. I checked with Cosmin and he is OK moving to what's now trunk.

          Show
          Todd Lipcon added a comment - Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is the patch. The plan for HBase 0.20.5 is to work against Tom's new 21 release or a 20 with HDFS-200 applied, not the current 21 branch. I checked with Cosmin and he is OK moving to what's now trunk.
          Hide
          Konstantin Shvachko added a comment -

          What is hdfs-909-ammendation.txt for?

          Show
          Konstantin Shvachko added a comment - What is hdfs-909-ammendation.txt for?
          Hide
          Todd Lipcon added a comment -

          hdfs-909-ammendation.txt goes with this comment above:

          https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12859069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12859069

          (the test as committed in trunk is flaky as well, this is a patch against trunk that fixes it. The bug is just in the test, though, not the code itself)

          Show
          Todd Lipcon added a comment - hdfs-909-ammendation.txt goes with this comment above: https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12859069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12859069 (the test as committed in trunk is flaky as well, this is a patch against trunk that fixes it. The bug is just in the test, though, not the code itself)
          Hide
          Konstantin Shvachko added a comment -
          • The issue is not closed, so it would be better to have a unified patch, rather than doing 2 commits. I don't mind to recommit.
          • Test for 0.20 passes fine now. Found 2 (eclipse) warnings in TestEditLogRace:
            • Method getFormattedFSImage() is not used anywhere.
            • Static method setBufferCapacity() should be called in static manner, like FSEditLog.setBufferCapacity()
          • I understand Tom's plan for 0.21. It does not hurt to commit though.
          Show
          Konstantin Shvachko added a comment - The issue is not closed, so it would be better to have a unified patch, rather than doing 2 commits. I don't mind to recommit. Test for 0.20 passes fine now. Found 2 (eclipse) warnings in TestEditLogRace: Method getFormattedFSImage() is not used anywhere. Static method setBufferCapacity() should be called in static manner, like FSEditLog.setBufferCapacity() I understand Tom's plan for 0.21. It does not hurt to commit though.
          Hide
          Todd Lipcon added a comment -

          Here's a unified patch for trunk (the one you committed to trunk plus the test case fixes)
          Also branch 20 patch that addresses the two eclipse warnings you found.

          Show
          Todd Lipcon added a comment - Here's a unified patch for trunk (the one you committed to trunk plus the test case fixes) Also branch 20 patch that addresses the two eclipse warnings you found.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this. Thank you Todd.

          Show
          Konstantin Shvachko added a comment - I just committed this. Thank you Todd.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Cosmin Lehene
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development