Hadoop Common
  1. Hadoop Common
  2. HADOOP-1955

Corrupted block replication retries for ever

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.14.1
    • Fix Version/s: 0.14.2, 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      When replicating corrupted block, receiving side rejects the block due to checksum error. Namenode keeps on retrying (with the same source datanode).
      Fsck shows those blocks as under-replicated.

      [Namenode log]

       
      2007-09-27 02:00:05,273 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 99.2.99.111
      ...
      2007-09-27 02:01:02,618 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.37:9999
      2007-09-27 02:10:03,843 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-5925066143536023890
      2007-09-27 02:10:08,248 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.35:9999
      2007-09-27 02:20:03,848 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-5925066143536023890
      2007-09-27 02:20:08,646 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.19:9999
      (repeats)
      

      [Datanode(sender) 99.9.99.11 log]

       
      2007-09-27 02:01:04,493 INFO org.apache.hadoop.dfs.DataNode: Starting thread to transfer block blk_-5925066143536023890 to [Lorg.apache.hadoop.dfs.DatanodeInfo;@e58187
      2007-09-27 02:01:05,153 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-5925066143536023890 to 74.6.128.37:50010 got java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
        at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
        at java.lang.Thread.run(Thread.java:619)
      (repeats)
      

      [Datanode(one of the receiver) 99.9.99.37 log]

       
      2007-09-27 02:01:05,150 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Unexpected checksum mismatch while writing blk_-5925066143536023890 from /74.6.128.33:57605
        at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:902)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)
        at java.lang.Thread.run(Thread.java:619)
      
      1. HADOOP-1955.patch
        1 kB
        Raghu Angadi
      2. HADOOP-1955-branch14.patch
        11 kB
        Raghu Angadi
      3. HADOOP-1955.patch
        10 kB
        Raghu Angadi
      4. HADOOP-1955-branch14.patch
        11 kB
        Raghu Angadi
      5. HADOOP-1955.patch
        10 kB
        Raghu Angadi
      6. HADOOP-1955-branch14.patch
        11 kB
        Raghu Angadi
      7. HADOOP-1955.patch
        11 kB
        Raghu Angadi

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          Hi Raghu, can you pl take a look into this one and suggest what could be done? Thanks. Possible options are:

          1. Try different source locations for the replication,
          2. Delete corrupted source replica
          3. If all replicas are corrupt, stop replication.

          Show
          dhruba borthakur added a comment - Hi Raghu, can you pl take a look into this one and suggest what could be done? Thanks. Possible options are: 1. Try different source locations for the replication, 2. Delete corrupted source replica 3. If all replicas are corrupt, stop replication.
          Hide
          Raghu Angadi added a comment -

          Yes, this is an issue.

          Koji, as a crude work around, could you try reading the file ? If reading succeeds, you could just manually remove the courrupt source block.

          Show
          Raghu Angadi added a comment - Yes, this is an issue. Koji, as a crude work around, could you try reading the file ? If reading succeeds, you could just manually remove the courrupt source block.
          Hide
          Raghu Angadi added a comment -

          Koji, did the work around help? I would like to know if asking different datanode to replicate would
          have helped in this case.

          Show
          Raghu Angadi added a comment - Koji, did the work around help? I would like to know if asking different datanode to replicate would have helped in this case.
          Hide
          Raghu Angadi added a comment -

          For 0.14.x, I will implement using different source replica for different retries of the replication. will discuss with Dhruba.

          Deleting corrupted replicas might be more involved for 0.14.x.
          Stopping replication after a few attempts requires a permenent structure in 'neededReplication'. This might be ok to do.

          Show
          Raghu Angadi added a comment - For 0.14.x, I will implement using different source replica for different retries of the replication. will discuss with Dhruba. Deleting corrupted replicas might be more involved for 0.14.x. Stopping replication after a few attempts requires a permenent structure in 'neededReplication'. This might be ok to do.
          Hide
          Raghu Angadi added a comment -

          Even in my dev set up, Namenode always asks the same datanode to replicate all the time.
          I think the reason is that when there is not much for datanodes, computeDatanodeWork() always
          traverses from first node to the last one. So in normal case, it always asks the same node to replicate.

          Fixing this might be the simplest thing for 0.14.x. I will try out a patch.

          Show
          Raghu Angadi added a comment - Even in my dev set up, Namenode always asks the same datanode to replicate all the time. I think the reason is that when there is not much for datanodes, computeDatanodeWork() always traverses from first node to the last one. So in normal case, it always asks the same node to replicate. Fixing this might be the simplest thing for 0.14.x. I will try out a patch.
          Hide
          Raghu Angadi added a comment -

          When NameNode is not heavily loaded, each time computeDatanodeWork() goes through the nodes in the same order. A side affect of this is that it asks the same node to replicate a block each time it tries to replicate it.

          When computeDatanodeWork runs through all the datanodes, this patch sets start index for next iteration to one after the datanode that was asked to replicate a node in the current iteration. This fixes the the problem seen in this jira (assuming the untried replica was not currpted). Initailly I thought of starting at random index but on a large cluster, it can take very long time before the second node is tried especially if the two source nodes are close to each other.

          If all the remaining replicas are corrupted, Namenode will keep on trying. Thats ok, we would like the users to report such cases.

          Show
          Raghu Angadi added a comment - When NameNode is not heavily loaded, each time computeDatanodeWork() goes through the nodes in the same order. A side affect of this is that it asks the same node to replicate a block each time it tries to replicate it. When computeDatanodeWork runs through all the datanodes, this patch sets start index for next iteration to one after the datanode that was asked to replicate a node in the current iteration. This fixes the the problem seen in this jira (assuming the untried replica was not currpted). Initailly I thought of starting at random index but on a large cluster, it can take very long time before the second node is tried especially if the two source nodes are close to each other. If all the remaining replicas are corrupted, Namenode will keep on trying. Thats ok, we would like the users to report such cases.
          Hide
          dhruba borthakur added a comment -

          +1. Code looks good. It would be nice to have a unit test for this one.

          There should be a separate JIRA that allows detection & deletion of corrupted replicas. Can you pl file that one (if it does not already exists) and link it to this one? Thanks.

          Show
          dhruba borthakur added a comment - +1. Code looks good. It would be nice to have a unit test for this one. There should be a separate JIRA that allows detection & deletion of corrupted replicas. Can you pl file that one (if it does not already exists) and link it to this one? Thanks.
          Hide
          Raghu Angadi added a comment -

          I am adding a test case as part of TestPendingReplication.java.

          Show
          Raghu Angadi added a comment - I am adding a test case as part of TestPendingReplication.java.
          Hide
          Koji Noguchi added a comment -

          If all the remaining replicas are corrupted, Namenode will keep on trying. Thats ok, we would like the users to report such cases.

          I didn't get this part. In my case, this infinite loop was started when one datanode went down and the namenode started replicating. Does this mean, namenode will keep on trying until someone access the file and notice that it's corrupted?

          Show
          Koji Noguchi added a comment - If all the remaining replicas are corrupted, Namenode will keep on trying. Thats ok, we would like the users to report such cases. I didn't get this part. In my case, this infinite loop was started when one datanode went down and the namenode started replicating. Does this mean, namenode will keep on trying until someone access the file and notice that it's corrupted?
          Hide
          Koji Noguchi added a comment -

          Koji, as a crude work around, could you try reading the file ? If reading succeeds, you could just manually remove the courrupt source block.

          Thanks Raghu. I haven't done this yet, but yes, this should work.

          Show
          Koji Noguchi added a comment - Koji, as a crude work around, could you try reading the file ? If reading succeeds, you could just manually remove the courrupt source block. Thanks Raghu. I haven't done this yet, but yes, this should work.
          Hide
          Raghu Angadi added a comment -

          In my case, this infinite loop was started when one datanode went down and the namenode started replicating. Does this mean, namenode will keep on trying until someone access the file and notice that it's corrupted?

          Yes, if there is no valid replica. In your case it is not clear if all the replicas are corrupted.

          With this patch, Namenode will try all the remaining replicas for replicating a block.
          If none of these succeed (because all the replicas are corrupted), there is not much
          Namenode can do about it. It will just keep on trying (evey 10 min) eventually someone
          will notice the error.

          In your case, if there is a good replica, it will be used in subsequent retries.

          Show
          Raghu Angadi added a comment - In my case, this infinite loop was started when one datanode went down and the namenode started replicating. Does this mean, namenode will keep on trying until someone access the file and notice that it's corrupted? Yes, if there is no valid replica. In your case it is not clear if all the replicas are corrupted. With this patch, Namenode will try all the remaining replicas for replicating a block. If none of these succeed (because all the replicas are corrupted), there is not much Namenode can do about it. It will just keep on trying (evey 10 min) eventually someone will notice the error. In your case, if there is a good replica, it will be used in subsequent retries.
          Hide
          Raghu Angadi added a comment -

          The latest patch includes unit test for proper replication with corrupted blocks.

          The actual fix for the jira is confined to FSNamesystem.java. Rest of the changes are for the unit test.

          I reluctantly added two config variables (not public in hadoop-default.xml) to make this test stable. One is a timeout for pendingReplication and other is timeout used by datanode to deleted the failed blocks from tmp directory. In the case of Datanode, an altrnative is to delete the tmp files immediately when a write fails. There is no reason to keep those failed blocks around.

          Dhruba, could you scan through the test and other changes? Thanks.

          Show
          Raghu Angadi added a comment - The latest patch includes unit test for proper replication with corrupted blocks. The actual fix for the jira is confined to FSNamesystem.java. Rest of the changes are for the unit test. I reluctantly added two config variables (not public in hadoop-default.xml) to make this test stable. One is a timeout for pendingReplication and other is timeout used by datanode to deleted the failed blocks from tmp directory. In the case of Datanode, an altrnative is to delete the tmp files immediately when a write fails. There is no reason to keep those failed blocks around. Dhruba, could you scan through the test and other changes? Thanks.
          Hide
          dhruba borthakur added a comment -

          Code looks good. A few minor comments:

          1. This test corrupts replicas of a block and checks to see that the Namenode detects this situation and replicates the remaining good copy. Does it really belong to TestReplication? Maybe TestCrcCorruption makes more sense.
          2. The test waits for 60 seconds for waitForBlockReplication. I vote that we remove this timeout and make the test wait indefinitely. It might not be a good idea to put adhoc wait periods in the test. The test framework already enforces a timeout of 10 minutes per test.
          3. Why do we need dfs.replication in the conf?

          Show
          dhruba borthakur added a comment - Code looks good. A few minor comments: 1. This test corrupts replicas of a block and checks to see that the Namenode detects this situation and replicates the remaining good copy. Does it really belong to TestReplication? Maybe TestCrcCorruption makes more sense. 2. The test waits for 60 seconds for waitForBlockReplication. I vote that we remove this timeout and make the test wait indefinitely. It might not be a good idea to put adhoc wait periods in the test. The test framework already enforces a timeout of 10 minutes per test. 3. Why do we need dfs.replication in the conf?
          Hide
          dhruba borthakur added a comment -

          Code looks good.

          However, one minor point is that none of our tests should have a fixed timeout values. This might cause problems on different platforms. I would vote for removing the fixed time of 60 seconds and wait indefinitely. I would like Nigel's input on this one.

          Show
          dhruba borthakur added a comment - Code looks good. However, one minor point is that none of our tests should have a fixed timeout values. This might cause problems on different platforms. I would vote for removing the fixed time of 60 seconds and wait indefinitely. I would like Nigel's input on this one.
          Hide
          Raghu Angadi added a comment -

          Updated patches have infinite time. Only to change is to change '60' to '-1'. One issue with not having a timeout is that when the test framework times out, we don't have access to the log, which makes it hard to debug.

          Show
          Raghu Angadi added a comment - Updated patches have infinite time. Only to change is to change '60' to '-1'. One issue with not having a timeout is that when the test framework times out, we don't have access to the log, which makes it hard to debug.
          Hide
          dhruba borthakur added a comment -

          I like this. +1.

          Show
          dhruba borthakur added a comment - I like this. +1.
          Hide
          Raghu Angadi added a comment -

          > 1. This test corrupts replicas of a block and checks to see that the Namenode detects this situation and replicates the remaining good copy. Does it really belong to TestReplication? Maybe TestCrcCorruption makes more sense.

          Sorry I missed this one earlier. The fix is for replication code. Corrupting the block is just one way of
          making sure that if Namenode does not use all available replicas, the test fails. This is a test for Namenode replication policy.

          Show
          Raghu Angadi added a comment - > 1. This test corrupts replicas of a block and checks to see that the Namenode detects this situation and replicates the remaining good copy. Does it really belong to TestReplication? Maybe TestCrcCorruption makes more sense. Sorry I missed this one earlier. The fix is for replication code. Corrupting the block is just one way of making sure that if Namenode does not use all available replicas, the test fails. This is a test for Namenode replication policy.
          Hide
          Raghu Angadi added a comment -

          Also there is TestReplication.java, may be we should move the test there. Let me know.
          Currently it is under TestPendingReplication.java

          Show
          Raghu Angadi added a comment - Also there is TestReplication.java, may be we should move the test there. Let me know. Currently it is under TestPendingReplication.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12367036/HADOOP-1955.patch
          against trunk revision r581745.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12367036/HADOOP-1955.patch against trunk revision r581745. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/882/console This message is automatically generated.
          Hide
          Raghu Angadi added a comment -

          fixed findbugs warning.

          Also moved the unit test to TestReplication.java from TestPendingReplication.java

          Show
          Raghu Angadi added a comment - fixed findbugs warning. Also moved the unit test to TestReplication.java from TestPendingReplication.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12367096/HADOOP-1955.patch
          against trunk revision r581982.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12367096/HADOOP-1955.patch against trunk revision r581982. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/console This message is automatically generated.
          Hide
          Raghu Angadi added a comment -

          +1, finally! build test failed is not related to this patch.

          Show
          Raghu Angadi added a comment - +1, finally! build test failed is not related to this patch.
          Hide
          dhruba borthakur added a comment -

          I just committed this.Thanks Raghu.

          Show
          dhruba borthakur added a comment - I just committed this.Thanks Raghu.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #261 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/261/ )
          Hide
          Koji Noguchi added a comment -

          when all replicas are corrupt

          Show
          Koji Noguchi added a comment - when all replicas are corrupt

            People

            • Assignee:
              Raghu Angadi
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development