Hadoop Common
  1. Hadoop Common
  2. HADOOP-3050

Cluster fall into infinite loop trying to replicate a block to a target that aready has this replica.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None

      Description

      This happened during a test run by Hudson. So fortunately we have all logs present.
      http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1987/console
      Search for TestDecommission. And look for block blk_167544198419718831 that is being replicated to node 127.0.0.1:65168 over and over again.
      The issue needs to be investigated. I am making it a blocker until it is.

      1. FailedTestDecommission.log
        601 kB
        Konstantin Shvachko
      2. blockReport2.patch
        7 kB
        Hairong Kuang
      3. blockReport1.patch
        8 kB
        Hairong Kuang
      4. blockReport.patch
        0.6 kB
        Hairong Kuang

        Activity

        Konstantin Shvachko created issue -
        Hide
        Nigel Daley added a comment -

        Konstantin, builds don't stay around forever on Hudson. I suggest to copy the relevant pieces into a text file and attach it to this issue.

        Show
        Nigel Daley added a comment - Konstantin, builds don't stay around forever on Hudson. I suggest to copy the relevant pieces into a text file and attach it to this issue.
        Hide
        Konstantin Shvachko added a comment -

        Here is the log of the failed test.

        Show
        Konstantin Shvachko added a comment - Here is the log of the failed test.
        Konstantin Shvachko made changes -
        Field Original Value New Value
        Attachment FailedTestDecommission.log [ 12378330 ]
        Hairong Kuang made changes -
        Assignee Hairong Kuang [ hairong ]
        Robert Chansler made changes -
        Component/s dfs [ 12310710 ]
        Hide
        Hairong Kuang added a comment -

        After examining the log, it looks that we got the following scenario:
        1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and datanode 3;
        2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be decomissioned.
        3. Datanode 1 reregistered with the namenode; but the block report came in before its network location was resolved; so its block report was dropped.
        4. Because the namenode does not know that datanode 1 has the blk_167544198419718831, it schedules to replicate the block to datanode 1 and datanode 4.
        5. The replication of the block failed because it already has the block.
        6. No additional block report was received until the end of the log. So the block replication kept on failing.

        Show
        Hairong Kuang added a comment - After examining the log, it looks that we got the following scenario: 1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and datanode 3; 2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be decomissioned. 3. Datanode 1 reregistered with the namenode; but the block report came in before its network location was resolved; so its block report was dropped. 4. Because the namenode does not know that datanode 1 has the blk_167544198419718831, it schedules to replicate the block to datanode 1 and datanode 4. 5. The replication of the block failed because it already has the block. 6. No additional block report was received until the end of the log. So the block replication kept on failing.
        Hide
        Hairong Kuang added a comment -

        Looks that the problem is caused by the flag indicating if a block report is processed not seting to be false when a datanode re-registers. Therefore, the namenode does not ask for a block report when the datanode's network location is resolved.

        Show
        Hairong Kuang added a comment - Looks that the problem is caused by the flag indicating if a block report is processed not seting to be false when a datanode re-registers. Therefore, the namenode does not ask for a block report when the datanode's network location is resolved.
        Hairong Kuang made changes -
        Attachment blockReport.patch [ 12379069 ]
        Hide
        Hairong Kuang added a comment -

        I have run TestDecommission with the patch for 50 times in a row on my linux box without seeing any failure.

        Show
        Hairong Kuang added a comment - I have run TestDecommission with the patch for 50 times in a row on my linux box without seeing any failure.
        Hide
        Hairong Kuang added a comment -

        I ended up fixing more problems associated with sending block reports.
        1. Datanode does not send the inital block report until requested by the namenode;
        2. namenode asks a datanode to send a block report when the datanode's network location is resolved as a reply to a heartbeat;
        3. Add a static field R of type Random to DataNode and replace all the use of new Random() with R.

        Show
        Hairong Kuang added a comment - I ended up fixing more problems associated with sending block reports. 1. Datanode does not send the inital block report until requested by the namenode; 2. namenode asks a datanode to send a block report when the datanode's network location is resolved as a reply to a heartbeat; 3. Add a static field R of type Random to DataNode and replace all the use of new Random() with R.
        Hairong Kuang made changes -
        Attachment blockReport1.patch [ 12379170 ]
        Hairong Kuang made changes -
        Affects Version/s 0.17.0 [ 12312913 ]
        Fix Version/s 0.17.0 [ 12312913 ]
        Affects Version/s 0.16.2 [ 12313051 ]
        Hide
        Hairong Kuang added a comment -

        This patch makes sure that the initial block report is sent once and only once.

        Show
        Hairong Kuang added a comment - This patch makes sure that the initial block report is sent once and only once.
        Hairong Kuang made changes -
        Attachment blockReport2.patch [ 12379281 ]
        Hairong Kuang made changes -
        Attachment blockReport2.patch [ 12379281 ]
        Hairong Kuang made changes -
        Attachment blockReport2.patch [ 12379291 ]
        Hairong Kuang made changes -
        Attachment blockReport2.patch [ 12379291 ]
        Hairong Kuang made changes -
        Attachment blockReport2.patch [ 12379315 ]
        Hairong Kuang made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.16.2 [ 12313051 ]
        Affects Version/s 0.17.0 [ 12312913 ]
        Hide
        Sanjay Radia added a comment -

        The code is fine.
        +1

        Show
        Sanjay Radia added a comment - The code is fine. +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379315/blockReport2.patch
        against trunk revision 643282.

        @author +1. The patch does not contain any @author tags.

        tests included -1. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379315/blockReport2.patch against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/console This message is automatically generated.
        Hide
        Hairong Kuang added a comment -

        TestDecommision triggers this bug once a while. So no unit test is provided.

        Show
        Hairong Kuang added a comment - TestDecommision triggers this bug once a while. So no unit test is provided.
        Nigel Daley made changes -
        Fix Version/s 0.17.0 [ 12312913 ]
        Fix Version/s 0.16.3 [ 12313092 ]
        Hide
        Hairong Kuang added a comment -

        I just committed the patch.

        Show
        Hairong Kuang added a comment - I just committed the patch.
        Hairong Kuang made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.16.3 [ 12313092 ]
        Fix Version/s 0.17.0 [ 12312913 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #451 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/451/ )
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s dfs [ 12310710 ]

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Konstantin Shvachko
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development