Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1467

Append pipeline never succeeds with more than one replica

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0, 0.23.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestPipelines appears to be failing on trunk:

      Should be RBW replica after sequence of calls append()/write()/hflush() expected:<RBW> but was:<FINALIZED>
      junit.framework.AssertionFailedError: Should be RBW replica after sequence of calls append()/write()/hflush() expected:<RBW> but was:<FINALIZED>
      at org.apache.hadoop.hdfs.TestPipelines.pipeline_01(TestPipelines.java:109)

      1. failed-TestPipelines.txt
        213 kB
        Todd Lipcon
      2. hdfs-1467.txt
        2 kB
        Todd Lipcon
      3. hdfs-1467-fixed.txt
        3 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          It turns out there's a really big bug in pipeline construction for append... I'm not sure why more tests aren't failing more often!

          Currently in BlockReceiver's constructor, we copy the reference to the provided block, and increment its generation stamp to the newGs. But, this is just a reference to the same Block object used in DataXceiver.opWriteBlock. So, when the DN tries to set up the append pipeline to its downstream mirror, it ends up passing the new generation stamp version of the block, and the downstream mirror will fail to construct a pipeline.

          The end result is that we can never successfully construct a pipeline for append with more than one datanode in it! This test would fail about 2/3 of the time since it would see the old replica on one of the nodes that didn't make it into the append pipeline.

          This patch fixees the BlockReceiver constructor to take a copy of the block, and TestPipelines seems to pass now. I also added some extra debug logging just to illustrate the problem.

          This probably should not be committed until we can add another test which shows the problem 100% of the time.

          Show
          Todd Lipcon added a comment - It turns out there's a really big bug in pipeline construction for append... I'm not sure why more tests aren't failing more often! Currently in BlockReceiver's constructor, we copy the reference to the provided block, and increment its generation stamp to the newGs . But, this is just a reference to the same Block object used in DataXceiver.opWriteBlock. So, when the DN tries to set up the append pipeline to its downstream mirror, it ends up passing the new generation stamp version of the block, and the downstream mirror will fail to construct a pipeline. The end result is that we can never successfully construct a pipeline for append with more than one datanode in it! This test would fail about 2/3 of the time since it would see the old replica on one of the nodes that didn't make it into the append pipeline. This patch fixees the BlockReceiver constructor to take a copy of the block, and TestPipelines seems to pass now. I also added some extra debug logging just to illustrate the problem. This probably should not be committed until we can add another test which shows the problem 100% of the time.
          Hide
          Todd Lipcon added a comment -

          Somehow managed to attach a patch for the wrong issue last time around

          Show
          Todd Lipcon added a comment - Somehow managed to attach a patch for the wrong issue last time around
          Hide
          Konstantin Shvachko added a comment -

          I don't quite understand what is wrong. I see we pass the old block and the newGs to
          datanode.data.append(block, newGs, minBytesRcvd);
          only after that we set newGs to the old block. Looks good to me so far.

          > we can never successfully construct a pipeline for append with more than one datanode in it!

          I can testify this had been tested. Can we try to trace it back - not to blame anybody, but to restate the historical events.
          May be Hairong can shed some light onto it.

          Show
          Konstantin Shvachko added a comment - I don't quite understand what is wrong. I see we pass the old block and the newGs to datanode.data.append(block, newGs, minBytesRcvd); only after that we set newGs to the old block. Looks good to me so far. > we can never successfully construct a pipeline for append with more than one datanode in it! I can testify this had been tested. Can we try to trace it back - not to blame anybody, but to restate the historical events. May be Hairong can shed some light onto it.
          Hide
          Todd Lipcon added a comment -

          Hey Konstantin. Here's the call sequence:

          • DataXceiver.opWriteBlock(in, block) (DataXceiver.java:219)
            • DataXceiver.java:270: call BlockReceiver constructor with block object from opWriteBlock argument
              • BlockReceiver.java:91 - we copy block to member variable
              • BlockReceiver.java:114 - we call datanode.data.append(...) with correct generation stamp block - so this is fine on this node
              • BlockReceiver.java:118 - block.setGenerationStamp(newGs) - affects the same block object from above
            • DataXceiver.java:304 - DataTransferProtocol.Sender.opWriteBlock with same block object, which now has the new generation stamp.

          Thus, the first DN in the pipeline correctly appends to the block, but the second one fails, since it's asked to append to the new GS block, not the old.

          You can see this in the attached test failure log. On line 887, the first datanode in the pipeline correctly opens replica for append:
          2010-10-19 20:09:21,075 INFO datanode.DataNode (FSDataset.java:append(1101)) - Appending to replica FinalizedReplica, blk_3950121169366352479_1001, FINALIZED
          But then when opWriteBlock is called for the second replica on line 901, it has the new genstamp:
          2010-10-19 20:09:21,106 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(231)) - Receiving block blk_3950121169366352479_1002 src: /127.0.0.1:59098 dest: /127.0.0.1:54152

          I'll dig through the version control and see if I can figure out how this was introduced if indeed it's a regression.

          Show
          Todd Lipcon added a comment - Hey Konstantin. Here's the call sequence: DataXceiver.opWriteBlock(in, block) (DataXceiver.java:219) DataXceiver.java:270: call BlockReceiver constructor with block object from opWriteBlock argument BlockReceiver.java:91 - we copy block to member variable BlockReceiver.java:114 - we call datanode.data.append(...) with correct generation stamp block - so this is fine on this node BlockReceiver.java:118 - block.setGenerationStamp(newGs) - affects the same block object from above DataXceiver.java:304 - DataTransferProtocol.Sender.opWriteBlock with same block object, which now has the new generation stamp. Thus, the first DN in the pipeline correctly appends to the block, but the second one fails, since it's asked to append to the new GS block, not the old. You can see this in the attached test failure log. On line 887, the first datanode in the pipeline correctly opens replica for append: 2010-10-19 20:09:21,075 INFO datanode.DataNode (FSDataset.java:append(1101)) - Appending to replica FinalizedReplica, blk_3950121169366352479_1001, FINALIZED But then when opWriteBlock is called for the second replica on line 901, it has the new genstamp: 2010-10-19 20:09:21,106 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(231)) - Receiving block blk_3950121169366352479_1002 src: /127.0.0.1:59098 dest: /127.0.0.1:54152 I'll dig through the version control and see if I can figure out how this was introduced if indeed it's a regression.
          Hide
          Todd Lipcon added a comment -

          Looks like HDFS-1407 is at fault here.

          Show
          Todd Lipcon added a comment - Looks like HDFS-1407 is at fault here.
          Hide
          Hairong Kuang added a comment -

          +1. This fix looks good to me.

          Show
          Hairong Kuang added a comment - +1. This fix looks good to me.
          Hide
          Hairong Kuang added a comment -

          I just committed this. Thanks Todd!

          Show
          Hairong Kuang added a comment - I just committed this. Thanks Todd!
          Hide
          Hairong Kuang added a comment -

          Note that this patch is also a trivial one. I compiled and ran TestPipelines before I committed this.

          Show
          Hairong Kuang added a comment - Note that this patch is also a trivial one. I compiled and ran TestPipelines before I committed this.
          Hide
          Suresh Srinivas added a comment -

          Thanks for catching this bug Todd.

          Show
          Suresh Srinivas added a comment - Thanks for catching this bug Todd.
          Hide
          Todd Lipcon added a comment -

          No problem. Can you guys suggest any more unit tests we might be able to add that would fail more reliably for something like this? It's worrisome that we didn't catch this for two months!

          Show
          Todd Lipcon added a comment - No problem. Can you guys suggest any more unit tests we might be able to add that would fail more reliably for something like this? It's worrisome that we didn't catch this for two months!
          Hide
          Todd Lipcon added a comment -

          It looks like this fix actually introduced some other problems... with this patch a lot of tests time out on trunk (eg TestReplication). Some code must have been depending on side-effect modifications inside BlockReceiver.

          Show
          Todd Lipcon added a comment - It looks like this fix actually introduced some other problems... with this patch a lot of tests time out on trunk (eg TestReplication). Some code must have been depending on side-effect modifications inside BlockReceiver.
          Hide
          Todd Lipcon added a comment -

          Reopening, I'm going to ask Eli to revert this patch until we figure out what's going on.

          Show
          Todd Lipcon added a comment - Reopening, I'm going to ask Eli to revert this patch until we figure out what's going on.
          Hide
          Eli Collins added a comment -

          I've reverted this from trunk and branch 22. Thanks for finding and debugging Todd.

          Show
          Eli Collins added a comment - I've reverted this from trunk and branch 22. Thanks for finding and debugging Todd.
          Hide
          Todd Lipcon added a comment -

          Take two!

          The issue was that we actually depended on BlockReceiver to mutate the block object. But, we want to use the original block object when passing on the request to the next node in the pipeline. This version just saves off the Block at the top of the function to a copy, and uses that for the downstream pipeline node.

          TestReplication and TestPipelines both pass with this new version of the patch. I'm going to run all the tests just to be safe this time

          Show
          Todd Lipcon added a comment - Take two! The issue was that we actually depended on BlockReceiver to mutate the block object. But, we want to use the original block object when passing on the request to the next node in the pipeline. This version just saves off the Block at the top of the function to a copy, and uses that for the downstream pipeline node. TestReplication and TestPipelines both pass with this new version of the patch. I'm going to run all the tests just to be safe this time
          Hide
          Todd Lipcon added a comment -

          Test results:
          [junit] Test org.apache.hadoop.hdfs.TestHDFSTrash FAILED (timeout)
          [junit] Test org.apache.hadoop.hdfs.server.namenode.TestStorageRestore FAILED
          [junit] Test org.apache.hadoop.hdfs.server.balancer.TestBalancer FAILED
          [junit] Test org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS FAILED
          [junit] Test org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace FAILED
          [junit] Test org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery FAILED

          These are all failing on trunk as well, so I think this can be re-committed after a review.

          Show
          Todd Lipcon added a comment - Test results: [junit] Test org.apache.hadoop.hdfs.TestHDFSTrash FAILED (timeout) [junit] Test org.apache.hadoop.hdfs.server.namenode.TestStorageRestore FAILED [junit] Test org.apache.hadoop.hdfs.server.balancer.TestBalancer FAILED [junit] Test org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS FAILED [junit] Test org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace FAILED [junit] Test org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery FAILED These are all failing on trunk as well, so I think this can be re-committed after a review.
          Hide
          Eli Collins added a comment -

          +1 The new patch looks good.

          Show
          Eli Collins added a comment - +1 The new patch looks good.
          Hide
          Eli Collins added a comment -

          I've committed this to trunk and branch 22. Thanks Todd.

          Show
          Eli Collins added a comment - I've committed this to trunk and branch 22. Thanks Todd.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development