Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestDatanodeDeath keeps failing occasionally. For example, see
      http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

      1. primaryDeath.patch
        3 kB
        dhruba borthakur
      2. primaryDeath.patch
        3 kB
        dhruba borthakur
      3. primaryDeath_0.18.patch
        13 kB
        Tsz Wo Nicholas Sze
      4. deathlog.patch
        1.0 kB
        dhruba borthakur

        Issue Links

          Activity

          Show
          Tsz Wo Nicholas Sze added a comment - Another example, http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3344/
          Hide
          Robert Chansler added a comment -

          Gotta be in 0.19!

          Show
          Robert Chansler added a comment - Gotta be in 0.19!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It turns out that TestDatanodeDeath.testSimple1() fails. Below are some recent failures in Hudson.
          http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3456/testReport/
          http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3453/testReport/

          Show
          Tsz Wo Nicholas Sze added a comment - It turns out that TestDatanodeDeath.testSimple1() fails. Below are some recent failures in Hudson. http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3456/testReport/ http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3453/testReport/
          Hide
          dhruba borthakur added a comment -

          This test failed because the second datanode was killed but the client wrongly detected that some other datanode was dead. This is documented in HADOOP-3416

          Show
          dhruba borthakur added a comment - This test failed because the second datanode was killed but the client wrongly detected that some other datanode was dead. This is documented in HADOOP-3416
          Hide
          Tsz Wo Nicholas Sze added a comment -

          So, are you saying that this is a bug in the test, Dhruba? If yes, it is a good news.

          Show
          Tsz Wo Nicholas Sze added a comment - So, are you saying that this is a bug in the test, Dhruba? If yes, it is a good news.
          Hide
          dhruba borthakur added a comment -

          It is likely to be a problem with the test. In real-life, if a datanode in the oipleine dies, it is very likely that all the threads in that Datanode become inactive, so it will not matter whether the block-receiver thread or the packer responder thread exited first.

          Show
          dhruba borthakur added a comment - It is likely to be a problem with the test. In real-life, if a datanode in the oipleine dies, it is very likely that all the threads in that Datanode become inactive, so it will not matter whether the block-receiver thread or the packer responder thread exited first.
          Hide
          Raghu Angadi added a comment -

          HADOOP-3416 does not explain this. 3416 is a bug on DFSClient which results in marking first datanode as the bad one. I think dependency of this jira on HADOOP-3416 should be removed.

          Show
          Raghu Angadi added a comment - HADOOP-3416 does not explain this. 3416 is a bug on DFSClient which results in marking first datanode as the bad one. I think dependency of this jira on HADOOP-3416 should be removed.
          Hide
          dhruba borthakur added a comment -

          If you look at the attached log to this JIRA, one can clearly see that the second datanode is being killed by the test. Hoqeer, the killing is done by sending interrupts to the second datanode. The PacketResponder thread of the second datanode sends back -2 to the first datanode beucase it encounters this InterruptedException. The firstdatanode thinks that the second datanode is good beucase it did send back a response, so the bad datanode must be the third one in the list.

          Show
          dhruba borthakur added a comment - If you look at the attached log to this JIRA, one can clearly see that the second datanode is being killed by the test. Hoqeer, the killing is done by sending interrupts to the second datanode. The PacketResponder thread of the second datanode sends back -2 to the first datanode beucase it encounters this InterruptedException. The firstdatanode thinks that the second datanode is good beucase it did send back a response, so the bad datanode must be the third one in the list.
          Hide
          Raghu Angadi added a comment -

          That probably what has happened. That also implies this is unrelated to HADOOP-3416, no?

          Show
          Raghu Angadi added a comment - That probably what has happened. That also implies this is unrelated to HADOOP-3416 , no?
          Hide
          Raghu Angadi added a comment -

          Also, do we know why multiple attempts to write after that have failed?

          e.g.

           2008-10-14 01:10:56,862 WARN  hdfs.DFSClient (DFSClient.java:processDatanodeError(2469)) - Error 
          Recovery for block  blk_-6934002956802922056_1001 failed  because recovery from primary datanode 127.0.0.1:61531 
          failed 3 times. Will retry...
          Show
          Raghu Angadi added a comment - Also, do we know why multiple attempts to write after that have failed? e.g. 2008-10-14 01:10:56,862 WARN hdfs.DFSClient (DFSClient.java:processDatanodeError(2469)) - Error Recovery for block blk_-6934002956802922056_1001 failed because recovery from primary datanode 127.0.0.1:61531 failed 3 times. Will retry...
          Hide
          Sameer Paranjpye added a comment -

          > It is likely to be a problem with the test. In real-life, if a datanode in the oipleine dies, it is very likely that all the threads in that Datanode become
          > inactive, so it will not matter whether the block-receiver thread or the packer responder thread exited first.

          The client should be able to do a better job of detecting which Datanode died, no? The current implementation means that if a Datanode in a write pipeline dies the writer wil frequently fail. Datanode death is not a very frequent event so this is likely not a blocker, but I'd say that our test is triggering a known issue rather than saying it has a bug.

          Show
          Sameer Paranjpye added a comment - > It is likely to be a problem with the test. In real-life, if a datanode in the oipleine dies, it is very likely that all the threads in that Datanode become > inactive, so it will not matter whether the block-receiver thread or the packer responder thread exited first. The client should be able to do a better job of detecting which Datanode died, no? The current implementation means that if a Datanode in a write pipeline dies the writer wil frequently fail. Datanode death is not a very frequent event so this is likely not a blocker, but I'd say that our test is triggering a known issue rather than saying it has a bug.
          Hide
          dhruba borthakur added a comment -

          @Sameer: I agree that out test is triggering a known issue that is more likely to occur when the Blockreceiver thread dies but the PacketResponder thread is still able to communicate a response to the upstream datanode (or client). This case typically does not occur in real life, because when a datanode dies all its threads cannot communicate with other datanodes in the pipeline. I am saying that this is not likely to be a blocker for 0.19; but on the other hand, if it is occuring frequently whiel running unit tests it is better to fix it quickly. I am thinking of changing the way the unit test kills a datanode. BTW, do you ever see this scenario being triggered in real-life cases, e.g. GridMix and or perforamance benchmarks?

          @Raghu: I was thinking that both this one and HADOOP-3416 is refers to the fact that the method employed by the unit test to kill datanodes is not very deterministic and that is the reason why this JIRA is related to 3416. Fixing one issue probably fixes the other one too. Let me do some investigation on how to fix it.

          Show
          dhruba borthakur added a comment - @Sameer: I agree that out test is triggering a known issue that is more likely to occur when the Blockreceiver thread dies but the PacketResponder thread is still able to communicate a response to the upstream datanode (or client). This case typically does not occur in real life, because when a datanode dies all its threads cannot communicate with other datanodes in the pipeline. I am saying that this is not likely to be a blocker for 0.19; but on the other hand, if it is occuring frequently whiel running unit tests it is better to fix it quickly. I am thinking of changing the way the unit test kills a datanode. BTW, do you ever see this scenario being triggered in real-life cases, e.g. GridMix and or perforamance benchmarks? @Raghu: I was thinking that both this one and HADOOP-3416 is refers to the fact that the method employed by the unit test to kill datanodes is not very deterministic and that is the reason why this JIRA is related to 3416. Fixing one issue probably fixes the other one too. Let me do some investigation on how to fix it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It seems that it will be more problematic if replication is > 3 since it is harder for client to tell which datanode is failing.

          Show
          Tsz Wo Nicholas Sze added a comment - It seems that it will be more problematic if replication is > 3 since it is harder for client to tell which datanode is failing.
          Hide
          Raghu Angadi added a comment -

          The fact that TestDatanodeDeath is unpredictable is probably a good thing, though it is painful to diagnose. It points out rare race conditions.

          HADOOP-3416 is about an actual bug in DFSClient and not a bug in the test. I don't think this jira is a bug in the test either. I see couple of issues pointed by this failure :

          • Datanode (not client) does not detect the problem datanode in the pipeline some cases.
            • We could live with this since write should succeed (though it got replicated to fewer datanodes than what was possible). This is also the reason why HADOOP-3416 was not blocker but HADOOP-3339 was.
          • In such cases, DFSClient does not seem to be able to recover. I think this is a pretty important bug to fix since it will result in hard failures on the write.

          Does this sound correct?

          Show
          Raghu Angadi added a comment - The fact that TestDatanodeDeath is unpredictable is probably a good thing, though it is painful to diagnose. It points out rare race conditions. HADOOP-3416 is about an actual bug in DFSClient and not a bug in the test. I don't think this jira is a bug in the test either. I see couple of issues pointed by this failure : Datanode (not client) does not detect the problem datanode in the pipeline some cases. We could live with this since write should succeed (though it got replicated to fewer datanodes than what was possible). This is also the reason why HADOOP-3416 was not blocker but HADOOP-3339 was. In such cases, DFSClient does not seem to be able to recover. I think this is a pretty important bug to fix since it will result in hard failures on the write. Does this sound correct?
          Hide
          Sameer Paranjpye added a comment -

          @Dhruba: I agree that this is not a blocker for 0.19. The out of phase thread deaths don't occur typically in real deployments. Also we haven't yet observed this condition occurring frequently on our grids.

          However, I think there are real deficiencies in error recovery for HDFS writes.

          1. the client does not correctly detect which link in the write pipeline failed
          2. the client tries to initiate block recovery from the dead Datanode, fails to do so and causes the write to fail. This is mostly due to 1. but can also occur if the recovery primary fails following a link failure.

          Ideally, a writer should fail only if

          1. the writer itself dies for some reason
          2. the writer loses all it's replicas

          This should be the subject of a different JIRA but I think we should spend some energy making it happen. For this issue, the best course might be to disable testSimple until we have a complete recovery story.

          Show
          Sameer Paranjpye added a comment - @Dhruba: I agree that this is not a blocker for 0.19. The out of phase thread deaths don't occur typically in real deployments. Also we haven't yet observed this condition occurring frequently on our grids. However, I think there are real deficiencies in error recovery for HDFS writes. the client does not correctly detect which link in the write pipeline failed the client tries to initiate block recovery from the dead Datanode, fails to do so and causes the write to fail. This is mostly due to 1. but can also occur if the recovery primary fails following a link failure. Ideally, a writer should fail only if the writer itself dies for some reason the writer loses all it's replicas This should be the subject of a different JIRA but I think we should spend some energy making it happen. For this issue, the best course might be to disable testSimple until we have a complete recovery story.
          Hide
          dhruba borthakur added a comment -

          Thanks Sameer for your suggestion.

          I followed Raghu's suggestion of investigating why the client never recovered. There were three datanodes A, B and C in the pipeline. The test killed B. The client thought that C was killed.

          The client designated A as the primary datanode. It made a recoverBlock RPC to the primary datanode. The primary datanode, in turn , should have made a recoverBlock RPC to itslf and B. However, this did not occur. Looking at the logs, it appears that the primary datanode tried to make a RPC only to B. This failed (again and again) and the primary datanode returned error to the client. The client than aborted.

          Show
          dhruba borthakur added a comment - Thanks Sameer for your suggestion. I followed Raghu's suggestion of investigating why the client never recovered. There were three datanodes A, B and C in the pipeline. The test killed B. The client thought that C was killed. The client designated A as the primary datanode. It made a recoverBlock RPC to the primary datanode. The primary datanode, in turn , should have made a recoverBlock RPC to itslf and B. However, this did not occur. Looking at the logs, it appears that the primary datanode tried to make a RPC only to B. This failed (again and again) and the primary datanode returned error to the client. The client than aborted.
          Hide
          dhruba borthakur added a comment -

          I am switching on additional debugging for the unit test. I will commit it soon so that we can get to the bottom of the real problem.

          Show
          dhruba borthakur added a comment - I am switching on additional debugging for the unit test. I will commit it soon so that we can get to the bottom of the real problem.
          Hide
          dhruba borthakur added a comment -

          If the recoverBlock RPC to the primary datanode fails, then remove mark it as bad and continue recovery.

          Show
          dhruba borthakur added a comment - If the recoverBlock RPC to the primary datanode fails, then remove mark it as bad and continue recovery.
          Hide
          dhruba borthakur added a comment -

          Merged patch with trunk.

          Show
          dhruba borthakur added a comment - Merged patch with trunk.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #635 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/635/)
          . Increase debug logging for unit test TestDatanodeDeath.
          (dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #635 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/635/ ) . Increase debug logging for unit test TestDatanodeDeath. (dhruba)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12392231/primaryDeath.patch
          against trunk revision 705215.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12392231/primaryDeath.patch against trunk revision 705215. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good. I ran the test 10 times and all of them have been passed.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good. I ran the test 10 times and all of them have been passed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I just committed this. Thanks Dhruba!

          Show
          Tsz Wo Nicholas Sze added a comment - I just committed this. Thanks Dhruba!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #637 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/637/)
          . If the primary datanode fails in DFSClent, remove it from the pipe line. (dhruba via szetszwo)

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #637 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/637/ ) . If the primary datanode fails in DFSClent, remove it from the pipe line. (dhruba via szetszwo)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This fix should also be committed to 0.18.

          Show
          Tsz Wo Nicholas Sze added a comment - This fix should also be committed to 0.18.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          primaryDeath_0.18.patch: for 0.18

          Show
          Tsz Wo Nicholas Sze added a comment - primaryDeath_0.18.patch: for 0.18
          Hide
          Tsz Wo Nicholas Sze added a comment -

          primaryDeath_0.18.patch does not make sense unless the DFSClient.processDatanodeError(..) codes in HADOOP-1700 are checked in to 0.18.

          Dhruba, should the DFSClient.processDatanodeError(..) in HADOOP-1700 codes should be committed to 0.18?

          Show
          Tsz Wo Nicholas Sze added a comment - primaryDeath_0.18.patch does not make sense unless the DFSClient.processDatanodeError(..) codes in HADOOP-1700 are checked in to 0.18. Dhruba, should the DFSClient.processDatanodeError(..) in HADOOP-1700 codes should be committed to 0.18?

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development