Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.2
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      hadoop dfs -cat file1 returns
      dfs.DFSClient: Could not obtain block blk_XX_0 from any node: java.io.IOException: No live nodes contain current block

      Tracing the history of the block from NN log, we found
      WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-6160940519231606858_0 reported from A1.A2.A3.A4:50010 current size is 9303872 reported size is 262144
      WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_-6160940519231606858_0 from A1.A2.A3.A4:50010
      INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_-6160940519231606858_0 on A1.A2.A3.A4:50010
      WARN org.apache.hadoop.fs.FSNamesystem: Error in deleting bad block blk_-6160940519231606858_0 org.apache.hadoop.dfs.SafeModeException: Cannot invalidate block blk_-6160940519231606858_0. Name node is in safe mode.
      WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-6160940519231606858_0 reported from B1.B2.B3.B4:50010 current size is 9303872 reported size is 306688
      WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_-6160940519231606858_0 from B1.B2.B3.B4:50010
      INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_-6160940519231606858_0 on B1.B2.B3.B4:50010
      WARN org.apache.hadoop.fs.FSNamesystem: Error in deleting bad block blk_-6160940519231606858_0 org.apache.hadoop.dfs.SafeModeException: Cannot invalidate block blk_-6160940519231606858_0. Name node is in safe mode.
      INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (C1.C2.C3.C4:50010, blk_-6160940519231606858_0) is added to recentInvalidateSets
      INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (D1.D2.D3.D4:50010, blk_-6160940519231606858_0) is added to recentInvalidateSets
      INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask C1.C2.C3.C4:50010 to delete blk_-6160940519231606858_0
      INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask D1.D2.D3.D4:50010 to delete blk_-6160940519231606858_0

      1. corruptBlocksStartup.patch
        6 kB
        Hairong Kuang
      2. corruptBlocksStartup1.patch
        6 kB
        Hairong Kuang
      3. corruptBlocksStartup1-br18.patch
        6 kB
        Hairong Kuang

        Issue Links

          Activity

          Hairong Kuang created issue -
          Hide
          Hairong Kuang added a comment -

          When invalid replicas (ones from A1.A2.A3.A4 and B1.B2.B3.B4) were reported, NN was able to detect that the replicas were corrupted and mark them as invalid by adding them to recentInvalidates. However the attempts to add them to recentInvalidates failed because of SafeMode. However, the replicas were also left in blocksMap. When NN were out of SafeMode, corrupted replicas were counted as valid ones and the block was treated as over-replicated. Unfortunately when removing excessive replicas, valid replicas (ones from C1.C2.C3.C4 and D1.D2.D3.D4) were chosen to be removed. Thus valid data got lost.

          It seems to me that when corrupted replicas are first detected in block report, they should be put in CorruptedBlockMap instead of invalidating them directly. At least NN would not count them as valid ones so false over-replication won't happen.

          Show
          Hairong Kuang added a comment - When invalid replicas (ones from A1.A2.A3.A4 and B1.B2.B3.B4) were reported, NN was able to detect that the replicas were corrupted and mark them as invalid by adding them to recentInvalidates. However the attempts to add them to recentInvalidates failed because of SafeMode. However, the replicas were also left in blocksMap. When NN were out of SafeMode, corrupted replicas were counted as valid ones and the block was treated as over-replicated. Unfortunately when removing excessive replicas, valid replicas (ones from C1.C2.C3.C4 and D1.D2.D3.D4) were chosen to be removed. Thus valid data got lost. It seems to me that when corrupted replicas are first detected in block report, they should be put in CorruptedBlockMap instead of invalidating them directly. At least NN would not count them as valid ones so false over-replication won't happen.
          Hide
          Hairong Kuang added a comment - - edited

          The data got lost when we upgraded our cluster from 0.17 to 0.18. The root cause of the problem is HADOOP-4663 which introduced a lot of corrupted blocks into dfs and thus polluted the data.

          Show
          Hairong Kuang added a comment - - edited The data got lost when we upgraded our cluster from 0.17 to 0.18. The root cause of the problem is HADOOP-4663 which introduced a lot of corrupted blocks into dfs and thus polluted the data.
          Hairong Kuang made changes -
          Field Original Value New Value
          Link This issue is related to HADOOP-4663 [ HADOOP-4663 ]
          Hide
          Hairong Kuang added a comment -

          This patch does
          1. addStoredBlock marks a block as corrupted instead of deleting it.
          2. addStoredBlock also check if a zero length block is truncated or not.
          3. invalidateBlock can be executed in the safe mode because it only changes in-memory data structure.
          4. markBlockAsCorrupt adds the corrupted block into corruptedBlockMap no matter even if the block is not under-replicated.

          Show
          Hairong Kuang added a comment - This patch does 1. addStoredBlock marks a block as corrupted instead of deleting it. 2. addStoredBlock also check if a zero length block is truncated or not. 3. invalidateBlock can be executed in the safe mode because it only changes in-memory data structure. 4. markBlockAsCorrupt adds the corrupted block into corruptedBlockMap no matter even if the block is not under-replicated.
          Hairong Kuang made changes -
          Attachment corruptBlocksStartup.patch [ 12395769 ]
          Hairong Kuang made changes -
          Link This issue relates to HADOOP-4848 [ HADOOP-4848 ]
          Hide
          Konstantin Shvachko added a comment -

          I think that just removing the safe mode condition from invalidateBlock() (point 3) should solve this particular problem.
          During startup (in safe mode) the block with smaller length will be added to recentInvalidates, and will not be scheduled for deletion until after the safe mode is turned off - this is how ReplicationMonitor works now.
          In general I agree with Hairong that shorter blocks should be considered corrupt, but doing it now and synchronizing it in 3 versions seams too complicated. This adds one more step before incorrect block gets into recentInvalidates. Namely you first place them into corruptReplicas, and when the blocks are fully replicated they will move into recentInvalidates. Verifying that all works correctly on that path is hard. I just tried. And I would rather incorporate this into HADOOP-4563.

          Show
          Konstantin Shvachko added a comment - I think that just removing the safe mode condition from invalidateBlock() (point 3) should solve this particular problem. During startup (in safe mode) the block with smaller length will be added to recentInvalidates, and will not be scheduled for deletion until after the safe mode is turned off - this is how ReplicationMonitor works now. In general I agree with Hairong that shorter blocks should be considered corrupt, but doing it now and synchronizing it in 3 versions seams too complicated. This adds one more step before incorrect block gets into recentInvalidates. Namely you first place them into corruptReplicas, and when the blocks are fully replicated they will move into recentInvalidates. Verifying that all works correctly on that path is hard. I just tried. And I would rather incorporate this into HADOOP-4563 .
          Hide
          Hairong Kuang added a comment -

          OK, I agree with Konstantin. The minimum the change, the better.

          Show
          Hairong Kuang added a comment - OK, I agree with Konstantin. The minimum the change, the better.
          Hide
          Hairong Kuang added a comment -

          I tried Konstantin's suggestion and found a problem with it. NN preserves a block's size persistently. So when the first truncated block report, NN is able to detect it and tries to invalidate it. But the invalidation fails because this is the only copy for this block. So this corrupted replica slips in and becomes a valid one.

          I will keep the first patch.

          Show
          Hairong Kuang added a comment - I tried Konstantin's suggestion and found a problem with it. NN preserves a block's size persistently. So when the first truncated block report, NN is able to detect it and tries to invalidate it. But the invalidation fails because this is the only copy for this block. So this corrupted replica slips in and becomes a valid one. I will keep the first patch.
          Hide
          Raghu Angadi added a comment -

          Looks good. Konstantin's suggestion would have made the patch much simpler but given that it still has an issue, we could go with the patch.

          One minor nit : There are two messages with "Corrupt block ... from" in two different places. It will be better if there is something in the message that can can tell these two apart.

          Show
          Raghu Angadi added a comment - Looks good. Konstantin's suggestion would have made the patch much simpler but given that it still has an issue, we could go with the patch. One minor nit : There are two messages with "Corrupt block ... from" in two different places. It will be better if there is something in the message that can can tell these two apart.
          Hide
          Sanjay Radia added a comment - - edited

          Looks good. 4 things: +1 modulo the following

          1) Change the comment
          // Delete new replica.
          to
          // mark new replica as corrupt

          2) for each of the cases check to see if the lease is open. If lease is not open, log an error that we got a length mismatch even when the file was not open.
          Also file a jira for the case when the lease is not open to perhaps write to edits log to record the new length (I am not sure if writing the new length
          is right or wrong but we can debate this on that jira).

          3) Your fix will not let us distinguish between true corruption caused by some bug in HDFS, and the normal mismatch that can occur during appends when
          a client dies (I am not sure of this but that is my recollection from the append discussions with Dhruba last year at yahoo).
          This is okay for now. But let us file a jira to fix this so that we can distinguish.
          The easy code fix for this is to add a field to internal data structure to record the original length in fsimage - but this will increase the usage of memory in the system
          since the 4 bytes will be multiplied by the number of logical blocks in the system.

          4) In my opinion the correct behavior for shorter blocks (but longer than the fsimage recorded length) is to invalidate as in the original code - however our invalidation code does not handle the case because if the "corrupt" block is the last one it keeps it as valid. Thus your patch is a good emergency fix to this very critical problem.
          I suggest that we file a jira to handle invaliding such invalid blocks correctly.
          Note here I am distinguishing between corrupt blocks (caused by hardware errors or by bugs in our software) and invalid blocks (those lenght mismatches that
          can occur due to client or other failures). Others may not share the distinction I make - lets debate that in the jira; we need to get this patch out ASAP.

          Show
          Sanjay Radia added a comment - - edited Looks good. 4 things: +1 modulo the following 1) Change the comment // Delete new replica. to // mark new replica as corrupt 2) for each of the cases check to see if the lease is open. If lease is not open, log an error that we got a length mismatch even when the file was not open. Also file a jira for the case when the lease is not open to perhaps write to edits log to record the new length (I am not sure if writing the new length is right or wrong but we can debate this on that jira). 3) Your fix will not let us distinguish between true corruption caused by some bug in HDFS, and the normal mismatch that can occur during appends when a client dies (I am not sure of this but that is my recollection from the append discussions with Dhruba last year at yahoo). This is okay for now. But let us file a jira to fix this so that we can distinguish. The easy code fix for this is to add a field to internal data structure to record the original length in fsimage - but this will increase the usage of memory in the system since the 4 bytes will be multiplied by the number of logical blocks in the system. 4) In my opinion the correct behavior for shorter blocks (but longer than the fsimage recorded length) is to invalidate as in the original code - however our invalidation code does not handle the case because if the "corrupt" block is the last one it keeps it as valid. Thus your patch is a good emergency fix to this very critical problem. I suggest that we file a jira to handle invaliding such invalid blocks correctly. Note here I am distinguishing between corrupt blocks (caused by hardware errors or by bugs in our software) and invalid blocks (those lenght mismatches that can occur due to client or other failures). Others may not share the distinction I make - lets debate that in the jira; we need to get this patch out ASAP.
          Hide
          Sanjay Radia added a comment -

          After some more thinking, lets do item 2 as a separate jira to allow us to further discuss it.

          Show
          Sanjay Radia added a comment - After some more thinking, lets do item 2 as a separate jira to allow us to further discuss it.
          Hide
          Hairong Kuang added a comment -

          An updated patch for trunk.

          Show
          Hairong Kuang added a comment - An updated patch for trunk.
          Hairong Kuang made changes -
          Attachment corruptBlocksStartup1.patch [ 12395980 ]
          Hide
          Hairong Kuang added a comment -

          The patch for 0.18 branch.

          Show
          Hairong Kuang added a comment - The patch for 0.18 branch.
          Hairong Kuang made changes -
          Attachment corruptBlocksStartup1-br18.patch [ 12395981 ]
          Hide
          Hairong Kuang added a comment -

          Ant test-patch result:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories.

          Ant test-core result:
          Only org.apache.hadoop.mapred.TestJobTrackerRestart failed.

          All failures are known and not related to this patch.

          Show
          Hairong Kuang added a comment - Ant test-patch result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories. Ant test-core result: Only org.apache.hadoop.mapred.TestJobTrackerRestart failed. All failures are known and not related to this patch.
          Hide
          Hairong Kuang added a comment -

          I've just committed this.

          Show
          Hairong Kuang added a comment - I've just committed this.
          Hairong Kuang made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #698 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/698/ )
          Hide
          dhruba borthakur added a comment -

          +1 on this patch, albeit belatedly.

          Block report processing should ignore blocks that are being written to via HADOOP-5027.

          Show
          dhruba borthakur added a comment - +1 on this patch, albeit belatedly. Block report processing should ignore blocks that are being written to via HADOOP-5027 .
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development