Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-168

Block report processing should compare gneration stamp

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If a reported block has a different generation stamp then the one stored in the NameNode, the reported block will be considered as invalid. This is incorrect since blocks with larger generation stamp are valid.

      1. 5027_20090114.patch
        4 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          The issue is about the pre HDFS-265 append. Closing as "won't fix".

          Show
          Tsz Wo Nicholas Sze added a comment - The issue is about the pre HDFS-265 append. Closing as "won't fix".
          Hide
          Robert Chansler added a comment -

          Demoted: 20 shouldn't have to wait for this.

          Show
          Robert Chansler added a comment - Demoted: 20 shouldn't have to wait for this.
          Hide
          Nigel Daley added a comment -

          As discussed on core-dev@ (http://www.nabble.com/Hadoop-0.19.1-td21739202.html) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.

          Show
          Nigel Daley added a comment - As discussed on core-dev@ ( http://www.nabble.com/Hadoop-0.19.1-td21739202.html ) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.
          Hide
          dhruba borthakur added a comment -

          I would like to add this to the Blocker list so that we can get a fix for this one in the next release of 0.19.

          Show
          dhruba borthakur added a comment - I would like to add this to the Blocker list so that we can get a fix for this one in the next release of 0.19.
          Hide
          Konstantin Shvachko added a comment -

          I see at least 6 currently open issues, which seem to be different, but actually being about the same problem.
          HADOOP-4663
          HADOOP-5027 (states 2 problems)
          HADOOP-5133
          HADOOP-4379
          HADOOP-4692
          The problem is consistent handling incomplete blocks or another way to say it consistent handling block replication.
          Trying to fix the issues separately is imHo a mistake.
          I am advocating to step back, get a general view, design a proposal, make sure it works and then fix issues.
          Gradually changing the code may destabilize the main hdfs functionality, which is not related to synchs or appends.
          Although I am not personally in favor of gradual changes in this particular case, you can still go this path
          e.g. in a separate branch with a potential to merging it back to trunk when it is done.
          I think people should have a choice to sacrifice certain functionality in favor of reliability if they want to.
          Gradual changes close this alternative.
          I'll link this comment from the related issues.

          Show
          Konstantin Shvachko added a comment - I see at least 6 currently open issues, which seem to be different, but actually being about the same problem. HADOOP-4663 HADOOP-5027 (states 2 problems) HADOOP-5133 HADOOP-4379 HADOOP-4692 The problem is consistent handling incomplete blocks or another way to say it consistent handling block replication . Trying to fix the issues separately is imHo a mistake. I am advocating to step back, get a general view, design a proposal, make sure it works and then fix issues. Gradually changing the code may destabilize the main hdfs functionality, which is not related to synchs or appends. Although I am not personally in favor of gradual changes in this particular case, you can still go this path e.g. in a separate branch with a potential to merging it back to trunk when it is done. I think people should have a choice to sacrifice certain functionality in favor of reliability if they want to. Gradual changes close this alternative. I'll link this comment from the related issues.
          Hide
          Konstantin Shvachko added a comment -

          May be it will work, but I would prefer to avoid complicating name-node logic if it can be handled by data-nodes.
          My comment proposes some alternatives.

          Show
          Konstantin Shvachko added a comment - May be it will work, but I would prefer to avoid complicating name-node logic if it can be handled by data-nodes. My comment proposes some alternatives.
          Hide
          dhruba borthakur added a comment -

          I see your point. Thanks.

          One solution may be to make reportDiff() add a block to the "toRemove" data structure only if it belongs to a file that is not under construction. Will that work?

          Show
          dhruba borthakur added a comment - I see your point. Thanks. One solution may be to make reportDiff() add a block to the "toRemove" data structure only if it belongs to a file that is not under construction. Will that work?
          Hide
          Konstantin Shvachko added a comment -

          > Block reports from this datanode will not have the block blk_10_25

          The block report from the data-node will not have neither blk_10_25 nor blk_10_26 or any block with id 10.
          But the name-node has this block in its block map. It will compare the new block report with what it thinks the data-node should have and will find out that the data-node does not have this replica anymore see reportDiff(), and will remove that replica location. Do I miss anything?
          I am looking at the code: NN calls removeStoredBlock() for this block, which removes it from the DatanodeDescriptor and calls updateNeededReplications().

          Show
          Konstantin Shvachko added a comment - > Block reports from this datanode will not have the block blk_10_25 The block report from the data-node will not have neither blk_10_25 nor blk_10_26 or any block with id 10. But the name-node has this block in its block map. It will compare the new block report with what it thinks the data-node should have and will find out that the data-node does not have this replica anymore see reportDiff(), and will remove that replica location. Do I miss anything? I am looking at the code: NN calls removeStoredBlock() for this block, which removes it from the DatanodeDescriptor and calls updateNeededReplications() .
          Hide
          dhruba borthakur added a comment -

          >When I open a file for append the last block of the file is copied from the main storage to a tmp directory to be tre

          The block also gets a new generation stamp.

          > And the name-node should mark it as under-replicated triggering replication process

          Namenode does not trigger re-replication for blocks that belong to files that are being written (or appended to).

          Let me try to explian this with an example. Suppose a file has one block blk_10_25. The generation stamp of this block is 25. Suppose this block is not full and is the only block in the file. Suppose a client wants to append a few bytes to this file. The client instructs the datanode to stamp this block with a generation stamp of 26. The block now becomes blk_10_26 and is moved from the real block directory to "blockBeingWritten" directory (see HADOOP-4663). The namenode metadat is also updated to record the block for this file to be blk_10_26. Block reports from this datanode will not have the block blk_10_25, but this is not a problem blk_25_10 is not a valid block anymore. These details are explained in th design doc for appends (HADOOP-1700).

          Show
          dhruba borthakur added a comment - >When I open a file for append the last block of the file is copied from the main storage to a tmp directory to be tre The block also gets a new generation stamp. > And the name-node should mark it as under-replicated triggering replication process Namenode does not trigger re-replication for blocks that belong to files that are being written (or appended to). Let me try to explian this with an example. Suppose a file has one block blk_10_25. The generation stamp of this block is 25. Suppose this block is not full and is the only block in the file. Suppose a client wants to append a few bytes to this file. The client instructs the datanode to stamp this block with a generation stamp of 26. The block now becomes blk_10_26 and is moved from the real block directory to "blockBeingWritten" directory (see HADOOP-4663 ). The namenode metadat is also updated to record the block for this file to be blk_10_26. Block reports from this datanode will not have the block blk_10_25, but this is not a problem blk_25_10 is not a valid block anymore. These details are explained in th design doc for appends ( HADOOP-1700 ).
          Hide
          Konstantin Shvachko added a comment -

          Since this is about block reports in the context of appends the following question should be appropriate here. May be also related to "tmp" directories discussion in HADOOP-4663.
          When I open a file for append the last block of the file is copied from the main storage to a tmp directory to be treated as if it is just being created. Then on the next blockReport since the block is not in the main storage it will not be reported to the name-node. And the name-node should mark it as under-replicated triggering replication process. Replication will not be able to proceed as all replicas on other machines are in the same state, that is in tmp directory. Is that a problem?
          The main problem here I think is that the name-node removes the location of the replica if it was not reported during a block report. If all 3 nodes report that they don't have the block, the block will become not readable, that is the name-node cannot give clients any locations of the block.

          Show
          Konstantin Shvachko added a comment - Since this is about block reports in the context of appends the following question should be appropriate here. May be also related to "tmp" directories discussion in HADOOP-4663 . When I open a file for append the last block of the file is copied from the main storage to a tmp directory to be treated as if it is just being created. Then on the next blockReport since the block is not in the main storage it will not be reported to the name-node. And the name-node should mark it as under-replicated triggering replication process. Replication will not be able to proceed as all replicas on other machines are in the same state, that is in tmp directory. Is that a problem? The main problem here I think is that the name-node removes the location of the replica if it was not reported during a block report. If all 3 nodes report that they don't have the block, the block will become not readable, that is the name-node cannot give clients any locations of the block.
          Hide
          dhruba borthakur added a comment -

          When a client is creating a block for the first time, the block is not being replicated by the NN. Only when the client is done writing to that block and the block is finalized does the NN start to replicate the block (if needed).

          The current design is to apply the above design to pre-existing block that is being opened for "append". The NN will not replicate this block until the appending-client is completely done with that block.

          I agree with Rob that the availability guarantees of a block being re-opened for write is not the same as a block that nobody is writing to. But "append" is a new feature and HADOOP-1700 states that this guarantee is acceptable. Rob's request is valid, but we can address that as an improvement over the current design and attack it in a seperate JIRA. Does that sound reasonable?

          This patch is imperative to make "appends" work as designed for 0.19.

          Show
          dhruba borthakur added a comment - When a client is creating a block for the first time, the block is not being replicated by the NN. Only when the client is done writing to that block and the block is finalized does the NN start to replicate the block (if needed). The current design is to apply the above design to pre-existing block that is being opened for "append". The NN will not replicate this block until the appending-client is completely done with that block. I agree with Rob that the availability guarantees of a block being re-opened for write is not the same as a block that nobody is writing to. But "append" is a new feature and HADOOP-1700 states that this guarantee is acceptable. Rob's request is valid, but we can address that as an improvement over the current design and attack it in a seperate JIRA. Does that sound reasonable? This patch is imperative to make "appends" work as designed for 0.19.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          As suggested by Rob, if the reported block is ignored in NN, the data available guarantee will not be as good as before. Specifically, the existing data stored in the last block being reopened may not be replicated. I think we need to work on the design first.

          Show
          Tsz Wo Nicholas Sze added a comment - As suggested by Rob, if the reported block is ignored in NN, the data available guarantee will not be as good as before. Specifically, the existing data stored in the last block being reopened may not be replicated. I think we need to work on the design first.
          Hide
          dhruba borthakur added a comment -

          I like Raghu's comment to make the warning message something like "This is an Unexpected error, please report back to the Hadoop development community" or something like that,

          Show
          dhruba borthakur added a comment - I like Raghu's comment to make the warning message something like "This is an Unexpected error, please report back to the Hadoop development community" or something like that,
          Hide
          Raghu Angadi added a comment -

          > The fix is described in the description and Dhruba's comment.

          all right.

          >> Btw, the patch just ignores such a block with a warning..., should it also delete such a block?
          >This case should not happen. If it happens, it indicates that there are bugs somewhere. Without knowing what is going on, it is better not delete the block.

          adding something like "this is unexpected" to the message might make admins and users report it back, if they notice it.

          Show
          Raghu Angadi added a comment - > The fix is described in the description and Dhruba's comment. all right. >> Btw, the patch just ignores such a block with a warning..., should it also delete such a block? >This case should not happen. If it happens, it indicates that there are bugs somewhere. Without knowing what is going on, it is better not delete the block. adding something like "this is unexpected" to the message might make admins and users report it back, if they notice it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I meant all the changes go through NN.

          It is currently going through NN. BTW, you may want to take a look the javadoc in LeaseManager. It describes how the generation stamps are updated.

          > Note that a patch was attached but the fix is not described anywhere in the jira. Finally I read the patch.

          The fix is described in the description and Dhruba's comment. If you need every detail of the fix, I think reading the patch is the best way. I am glad that you have already done so.

          > Looking at the patch now, it looks like you do (correctly) consider it an error if reported gs is larger than stored gs.From the description of the jira, I thought NN might consider new reported gs as the correct one.

          We are considering the case that the blocks are being written. Sorry that it is not clear.

          > Btw, the patch just ignores such a block with a warning..., should it also delete such a block?

          This case should not happen. If it happens, it indicates that there are bugs somewhere. Without knowing what is going on, it is better not delete the block.

          Show
          Tsz Wo Nicholas Sze added a comment - > I meant all the changes go through NN. It is currently going through NN. BTW, you may want to take a look the javadoc in LeaseManager. It describes how the generation stamps are updated. > Note that a patch was attached but the fix is not described anywhere in the jira. Finally I read the patch. The fix is described in the description and Dhruba's comment . If you need every detail of the fix, I think reading the patch is the best way. I am glad that you have already done so. > Looking at the patch now, it looks like you do (correctly) consider it an error if reported gs is larger than stored gs.From the description of the jira, I thought NN might consider new reported gs as the correct one. We are considering the case that the blocks are being written. Sorry that it is not clear. > Btw, the patch just ignores such a block with a warning..., should it also delete such a block? This case should not happen. If it happens, it indicates that there are bugs somewhere. Without knowing what is going on, it is better not delete the block.
          Hide
          Konstantin Shvachko added a comment -

          > nextGenerationStamp(Block) should not update the block information in the NameNode since the update may fail.

          The update has already failed, that is why you do the recovery in the first place, right? So NN already may have a wrong generation stamp. I still don't understand why nextGenerationStamp(Block) cannot change Block's gen stamp.

          Another thing if you change blockReport to understand generation stamps, then the same should be done for blockReceived().

          I have always been under the impression that generation stamp plays role only during lease recovery, and that blockReport and blockReceived distinguish blocks by their id = <block#> + <genStamp>. I would rather keep it that simple if possible.

          Show
          Konstantin Shvachko added a comment - > nextGenerationStamp(Block) should not update the block information in the NameNode since the update may fail. The update has already failed, that is why you do the recovery in the first place, right? So NN already may have a wrong generation stamp. I still don't understand why nextGenerationStamp(Block) cannot change Block's gen stamp. Another thing if you change blockReport to understand generation stamps, then the same should be done for blockReceived() . I have always been under the impression that generation stamp plays role only during lease recovery, and that blockReport and blockReceived distinguish blocks by their id = <block#> + <genStamp> . I would rather keep it that simple if possible.
          Hide
          Raghu Angadi added a comment -

          > What do you mean "done by NN"?

          I meant all the changes go through NN.

          Note that a patch was attached but the fix is not described anywhere in the jira. Finally I read the patch.

          Looking at the patch now, it looks like you do (correctly) consider it an error if reported gs is larger than stored gs.From the description of the jira, I thought NN might consider new reported gs as the correct one. Btw, the patch just ignores such a block with a warning..., should it also delete such a block?

          Show
          Raghu Angadi added a comment - > What do you mean "done by NN"? I meant all the changes go through NN. Note that a patch was attached but the fix is not described anywhere in the jira. Finally I read the patch. Looking at the patch now, it looks like you do (correctly) consider it an error if reported gs is larger than stored gs.From the description of the jira, I thought NN might consider new reported gs as the correct one. Btw, the patch just ignores such a block with a warning..., should it also delete such a block?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I thought all the gen-stamp changes are done by NN.

          What do you mean "done by NN"?

          > It is a bit scary if generation can be changed by clients and datanodes...

          Clients cannot change gs. They only can request a update. Then, the selected primary datanode will request a generation stamp from the namenode.

          Show
          Tsz Wo Nicholas Sze added a comment - > I thought all the gen-stamp changes are done by NN. What do you mean "done by NN"? > It is a bit scary if generation can be changed by clients and datanodes... Clients cannot change gs. They only can request a update. Then, the selected primary datanode will request a generation stamp from the namenode.
          Hide
          Raghu Angadi added a comment -

          > The client data-node should call name-node to obtain a new generation stamp any way, I believe this what Raghu meant?

          Yes. I thought all the gen-stamp changes are done by NN.

          It is a bit scary if generation can be changed by clients and datanodes. That would be very different from rest of HDFS metadata management.

          Show
          Raghu Angadi added a comment - > The client data-node should call name-node to obtain a new generation stamp any way, I believe this what Raghu meant? Yes. I thought all the gen-stamp changes are done by NN. It is a bit scary if generation can be changed by clients and datanodes. That would be very different from rest of HDFS metadata management.
          Hide
          Tsz Wo Nicholas Sze added a comment - - edited

          > May be the problem is not with the block report, but rather with nextGenerationStamp(Block), ...

          nextGenerationStamp(Block) should not update the block information in the NameNode since the update may fail. If the NameNode block gs is updated but the DataNode block gs is not and the update fails, the the block in the DataNode will be considered as invalid.

          > I also would like to see a unit test catching this bug.

          Sure, working on one.

          Show
          Tsz Wo Nicholas Sze added a comment - - edited > May be the problem is not with the block report, but rather with nextGenerationStamp(Block), ... nextGenerationStamp(Block) should not update the block information in the NameNode since the update may fail . If the NameNode block gs is updated but the DataNode block gs is not and the update fails, the the block in the DataNode will be considered as invalid. > I also would like to see a unit test catching this bug. Sure, working on one.
          Hide
          Konstantin Shvachko added a comment - - edited

          May be the problem is not with the block report, but rather with nextGenerationStamp(Block), which does not update generation stamp for the block. The client data-node should call name-node to obtain a new generation stamp any way, I believe this what Raghu meant? So why do we not changing the gen stamp of the block.
          I also would like to see a unit test catching this bug.

          Show
          Konstantin Shvachko added a comment - - edited May be the problem is not with the block report, but rather with nextGenerationStamp(Block) , which does not update generation stamp for the block. The client data-node should call name-node to obtain a new generation stamp any way, I believe this what Raghu meant? So why do we not changing the gen stamp of the block. I also would like to see a unit test catching this bug.
          Hide
          dhruba borthakur added a comment -

          This code looks good to me. There is already a unit test TestFleAppend2 that tests "append". Any clues on why this bug is not caught by that unit test?

          Show
          dhruba borthakur added a comment - This code looks good to me. There is already a unit test TestFleAppend2 that tests "append". Any clues on why this bug is not caught by that unit test?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          5027_20090114.patch: check generation stamps in block report processing.

          Show
          Tsz Wo Nicholas Sze added a comment - 5027_20090114.patch: check generation stamps in block report processing.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Doesn't NN already know what the latest genstamp should be?

          When a block is being updated in lease recovery, some replicators might have a gs larger then the one stored in the NameNode. If the update fails, the latest gs may not be committed to the NameNode.

          Show
          Tsz Wo Nicholas Sze added a comment - > Doesn't NN already know what the latest genstamp should be? When a block is being updated in lease recovery, some replicators might have a gs larger then the one stored in the NameNode. If the update fails, the latest gs may not be committed to the NameNode.
          Hide
          Raghu Angadi added a comment -

          Doesn't NN already know what the latest genstamp should be? All the genstamp changes are made by NN.

          Show
          Raghu Angadi added a comment - Doesn't NN already know what the latest genstamp should be? All the genstamp changes are made by NN.
          Hide
          dhruba borthakur added a comment -

          A block-report processing should ignore blocks if it is the last block in an file that is under construction.

          Show
          dhruba borthakur added a comment - A block-report processing should ignore blocks if it is the last block in an file that is under construction.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This is not a problem if both sync and append are not used.

          Show
          Tsz Wo Nicholas Sze added a comment - This is not a problem if both sync and append are not used.

            People

            • Assignee:
              Unassigned
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development