Hadoop Common
  1. Hadoop Common
  2. HADOOP-3050

Cluster fall into infinite loop trying to replicate a block to a target that aready has this replica.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None

      Description

      This happened during a test run by Hudson. So fortunately we have all logs present.
      http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1987/console
      Search for TestDecommission. And look for block blk_167544198419718831 that is being replicated to node 127.0.0.1:65168 over and over again.
      The issue needs to be investigated. I am making it a blocker until it is.

      1. blockReport2.patch
        7 kB
        Hairong Kuang
      2. blockReport1.patch
        8 kB
        Hairong Kuang
      3. blockReport.patch
        0.6 kB
        Hairong Kuang
      4. FailedTestDecommission.log
        601 kB
        Konstantin Shvachko

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #451 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/451/ )
        Hide
        Hairong Kuang added a comment -

        I just committed the patch.

        Show
        Hairong Kuang added a comment - I just committed the patch.
        Hide
        Hairong Kuang added a comment -

        TestDecommision triggers this bug once a while. So no unit test is provided.

        Show
        Hairong Kuang added a comment - TestDecommision triggers this bug once a while. So no unit test is provided.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379315/blockReport2.patch
        against trunk revision 643282.

        @author +1. The patch does not contain any @author tags.

        tests included -1. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379315/blockReport2.patch against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2151/console This message is automatically generated.
        Hide
        Sanjay Radia added a comment -

        The code is fine.
        +1

        Show
        Sanjay Radia added a comment - The code is fine. +1
        Hide
        Hairong Kuang added a comment -

        This patch makes sure that the initial block report is sent once and only once.

        Show
        Hairong Kuang added a comment - This patch makes sure that the initial block report is sent once and only once.
        Hide
        Hairong Kuang added a comment -

        I ended up fixing more problems associated with sending block reports.
        1. Datanode does not send the inital block report until requested by the namenode;
        2. namenode asks a datanode to send a block report when the datanode's network location is resolved as a reply to a heartbeat;
        3. Add a static field R of type Random to DataNode and replace all the use of new Random() with R.

        Show
        Hairong Kuang added a comment - I ended up fixing more problems associated with sending block reports. 1. Datanode does not send the inital block report until requested by the namenode; 2. namenode asks a datanode to send a block report when the datanode's network location is resolved as a reply to a heartbeat; 3. Add a static field R of type Random to DataNode and replace all the use of new Random() with R.
        Hide
        Hairong Kuang added a comment -

        I have run TestDecommission with the patch for 50 times in a row on my linux box without seeing any failure.

        Show
        Hairong Kuang added a comment - I have run TestDecommission with the patch for 50 times in a row on my linux box without seeing any failure.
        Hide
        Hairong Kuang added a comment -

        Looks that the problem is caused by the flag indicating if a block report is processed not seting to be false when a datanode re-registers. Therefore, the namenode does not ask for a block report when the datanode's network location is resolved.

        Show
        Hairong Kuang added a comment - Looks that the problem is caused by the flag indicating if a block report is processed not seting to be false when a datanode re-registers. Therefore, the namenode does not ask for a block report when the datanode's network location is resolved.
        Hide
        Hairong Kuang added a comment -

        After examining the log, it looks that we got the following scenario:
        1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and datanode 3;
        2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be decomissioned.
        3. Datanode 1 reregistered with the namenode; but the block report came in before its network location was resolved; so its block report was dropped.
        4. Because the namenode does not know that datanode 1 has the blk_167544198419718831, it schedules to replicate the block to datanode 1 and datanode 4.
        5. The replication of the block failed because it already has the block.
        6. No additional block report was received until the end of the log. So the block replication kept on failing.

        Show
        Hairong Kuang added a comment - After examining the log, it looks that we got the following scenario: 1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and datanode 3; 2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be decomissioned. 3. Datanode 1 reregistered with the namenode; but the block report came in before its network location was resolved; so its block report was dropped. 4. Because the namenode does not know that datanode 1 has the blk_167544198419718831, it schedules to replicate the block to datanode 1 and datanode 4. 5. The replication of the block failed because it already has the block. 6. No additional block report was received until the end of the log. So the block replication kept on failing.
        Hide
        Konstantin Shvachko added a comment -

        Here is the log of the failed test.

        Show
        Konstantin Shvachko added a comment - Here is the log of the failed test.
        Hide
        Nigel Daley added a comment -

        Konstantin, builds don't stay around forever on Hudson. I suggest to copy the relevant pieces into a text file and attach it to this issue.

        Show
        Nigel Daley added a comment - Konstantin, builds don't stay around forever on Hudson. I suggest to copy the relevant pieces into a text file and attach it to this issue.

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Konstantin Shvachko
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development