Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-909

Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.20.2, 0.21.0, 0.22.0
    • Fix Version/s: 0.20.3, 0.21.0
    • Component/s: namenode
    • Labels:
      None
    • Environment:

      CentOS

    • Hadoop Flags:
      Reviewed

      Description

      closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log.

      Example:

      FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
      FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
      OR
      FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
      FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
      
      VERSUS
      
      FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
      FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
      OR 
      Any FSEditLog.write
      

      Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above

      So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart.

      We got to this point after investigating a corrupted edits file that made HDFS unable to start with

      namenode.log
      java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. 
              at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
      

      EDIT: moved the logs to a comment to make this readable

      1. hdfs-909-unittest.txt
        6 kB
        Todd Lipcon
      2. hdfs-909-unified.txt
        26 kB
        Todd Lipcon
      3. hdfs-909-branch-0.21.txt
        24 kB
        Konstantin Shvachko
      4. hdfs-909-branch-0.21.txt
        25 kB
        Konstantin Shvachko
      5. hdfs-909-branch-0.20.txt
        25 kB
        Todd Lipcon
      6. hdfs-909-branch-0.20.txt
        25 kB
        Todd Lipcon
      7. hdfs-909-ammendation.txt
        7 kB
        Todd Lipcon
      8. hdfs-909.txt
        11 kB
        Todd Lipcon
      9. hdfs-909.txt
        12 kB
        Todd Lipcon
      10. hdfs-909.txt
        12 kB
        Todd Lipcon
      11. hdfs-909.txt
        25 kB
        Todd Lipcon
      12. hdfs-909.txt
        25 kB
        Todd Lipcon
      13. ASF.LICENSE.NOT.GRANTED--hdfs-909-branch-0.20.txt
        25 kB
        Todd Lipcon
      14. ASF.LICENSE.NOT.GRANTED--hdfs-909.txt
        25 kB
        Todd Lipcon

        Issue Links

          Activity

          Cosmin Lehene created issue -
          Cosmin Lehene made changes -
          Field Original Value New Value
          Affects Version/s 0.20.1 [ 12314048 ]
          Affects Version/s 0.20.2 [ 12314204 ]
          Todd Lipcon made changes -
          Attachment hdfs-909-unittest.txt [ 12431731 ]
          Todd Lipcon made changes -
          Assignee Todd Lipcon [ tlipcon ]
          Todd Lipcon made changes -
          Attachment hdfs-909.txt [ 12431737 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Todd Lipcon made changes -
          Attachment hdfs-909.txt [ 12434498 ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Cosmin Lehene made changes -
          Description closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log.

          Example:
          {code}
          FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
          FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
          OR
          FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()
          FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()

          VERSUS

          FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
          FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
          OR
          Any FSEditLog.write
          {code}

          Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above

          So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart.

          We got to this point after investigating a corrupted edits file that made HDFS unable to start with

          {code:title=namenode.log}
          java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768.
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
          {code}

          In the edits file we found the first 2 entries:
          {code:title=edits}
          FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F7570010100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902
          ...
          {code}
          This is the completeFile operation that's missing the last byte

          {code:title=completeFile}
          FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001??
          {code}
          followed by a rename operation

          {code:Title=rename}
          0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639
          {code}

          The first byte of the rename was instead read as part of the completeFile() operation. This resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which would have been 0x0000030 which is 768 that was read and failed in the following code

          {code:Title=FSEditLog.java}
           case OP_ADD:
                  case OP_CLOSE: {
                    // versions > 0 support per file replication
                    // get name and replication
                    int length = in.readInt();
                    if (-7 == logVersion && length != 3||
                        -17 < logVersion && logVersion < -7 && length != 4 ||
                        logVersion <= -17 && length != 5) {
                        throw new IOException("Incorrect data format." +
                                              " logVersion is " + logVersion +
                                              " but writables.length is " +
                                              length + ". ");
          {code}

          Filling the missing byte with a value resulted in correct interpretation of the 0x01 (OP_RENAME) and correct parsing for the rest of the edits file (>1MB)

          This theory is also supported by the NameNode log file from the date the corruption took place:

          {code:title=namenode.log}
          2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.72.17.162
          2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
          2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
          2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]}
          2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW], ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
          2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
          2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
          2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW], ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498





          2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667
          2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from 10.72.17.162
          2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop ip=/10.72.17.166 cmd=rename src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279 dst=/hbase/demo__users/364051646/test/6917381288304474386 perm=hadoop:supergroup:rw-r--r--
          {code}

          The last 3 entries show a completeFile operation followed by a rollFSIMage operation followed by a rename operation. This is where most probably the race condition took place.
          Synchronizing access to EditLogOutputStream could fix the problem, however other race or deadlocks may still occur.
          closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log.

          Example:
          {code}
          FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
          FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
          OR
          FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()
          FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()

          VERSUS

          FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
          FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
          OR
          Any FSEditLog.write
          {code}

          Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above

          So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart.

          We got to this point after investigating a corrupted edits file that made HDFS unable to start with

          {code:title=namenode.log}
          java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768.
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
          {code}

          EDIT: moved the logs to a comment to make this readable
          Todd Lipcon made changes -
          Attachment hdfs-909.txt [ 12434606 ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Todd Lipcon made changes -
          Link This issue relates to HDFS-955 [ HDFS-955 ]
          Todd Lipcon made changes -
          Attachment hdfs-909.txt [ 12435924 ]
          Todd Lipcon made changes -
          Link This issue is blocked by HADOOP-6554 [ HADOOP-6554 ]
          Todd Lipcon made changes -
          Link This issue is related to HDFS-980 [ HDFS-980 ]
          Todd Lipcon made changes -
          Link This issue relates to HDFS-956 [ HDFS-956 ]
          dhruba borthakur made changes -
          Link This issue relates to HDFS-988 [ HDFS-988 ]
          Todd Lipcon made changes -
          Attachment hdfs-909.txt [ 12440954 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Todd Lipcon made changes -
          Attachment hdfs-909.txt [ 12441669 ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Todd Lipcon made changes -
          Attachment hdfs-909-branch-0.20.txt [ 12442028 ]
          Konstantin Shvachko made changes -
          Attachment hdfs-909-branch-0.21.txt [ 12442243 ]
          Todd Lipcon made changes -
          Attachment hdfs-909-ammendation.txt [ 12442350 ]
          Todd Lipcon made changes -
          Attachment hdfs-909-branch-0.20.txt [ 12442357 ]
          Todd Lipcon made changes -
          Attachment hdfs-909-unified.txt [ 12442370 ]
          Attachment hdfs-909-branch-0.20.txt [ 12442371 ]
          Konstantin Shvachko made changes -
          Attachment hdfs-909-branch-0.21.txt [ 12442385 ]
          Konstantin Shvachko made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Konstantin Shvachko made changes -
          Fix Version/s 0.20.3 [ 12314814 ]
          Tom White made changes -
          Fix Version/s 0.22.0 [ 12314241 ]
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Cosmin Lehene
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development