Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-142

In 0.20, move blocks being written into a blocksBeingWritten directory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20-append
    • Fix Version/s: 0.20-append, 0.20.205.0
    • Component/s: None
    • Labels:
      None
    • Tags:
      hbase

      Description

      Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp directory since these files are not valid anymore. But in 0.18 it moves these files to normal directory incorrectly making them valid blocks. One of the following would work :

      • remove the tmp files during upgrade, or
      • if the files under /tmp are in pre-18 format (i.e. no generation), delete them.

      Currently effect of this bug is that, these files end up failing block verification and eventually get deleted. But cause incorrect over-replication at the namenode before that.

      Also it looks like our policy regd treating files under tmp needs to be defined better. Right now there are probably one or two more bugs with it. Dhruba, please file them if you rememeber.

      1. appendFile-recheck-lease.txt
        6 kB
        Todd Lipcon
      2. appendQuestions.txt
        8 kB
        dhruba borthakur
      3. deleteTmp_0.18.patch
        2 kB
        dhruba borthakur
      4. deleteTmp.patch
        6 kB
        dhruba borthakur
      5. deleteTmp2.patch
        13 kB
        dhruba borthakur
      6. deleteTmp5_20.txt
        32 kB
        dhruba borthakur
      7. deleteTmp5_20.txt
        35 kB
        dhruba borthakur
      8. dont-recover-rwr-when-rbw-available.txt
        24 kB
        Todd Lipcon
      9. handleTmp1.patch
        9 kB
        dhruba borthakur
      10. HDFS-142_20.patch
        54 kB
        Nicolas Spiegelberg
      11. HDFS-142_20-append2.patch
        84 kB
        Nicolas Spiegelberg
      12. HDFS-142.20-security.1.patch
        88 kB
        Jitendra Nath Pandey
      13. HDFS-142.20-security.2.patch
        89 kB
        Jitendra Nath Pandey
      14. hdfs-142-commitBlockSynchronization-unknown-datanode.txt
        2 kB
        Todd Lipcon
      15. HDFS-142-deaddn-fix.patch
        3 kB
        Nicolas Spiegelberg
      16. HDFS-142-finalize-fix.txt
        7 kB
        sam rash
      17. hdfs-142-minidfs-fix-from-409.txt
        2 kB
        Todd Lipcon
      18. HDFS-142-multiple-blocks-datanode-exception.patch
        3 kB
        Karthik Ranganathan
      19. hdfs-142-recovery-reassignment-and-bbw-cleanup.txt
        19 kB
        Todd Lipcon
      20. hdfs-142-testcases.txt
        8 kB
        Todd Lipcon
      21. hdfs-142-testleaserecovery-fix.txt
        5 kB
        Todd Lipcon
      22. recentInvalidateSets-assertion-fix.txt
        1 kB
        Todd Lipcon
      23. recover-rbw-v2.txt
        27 kB
        Todd Lipcon
      24. testfileappend4-deaddn.txt
        2 kB
        Todd Lipcon
      25. validateBlockMetaData-synchronized.txt
        0.9 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          I would probably take the first approach of deleting the tmp files during the upgrade. Related JIRAs are HADOOP-3677 and HADOOP-2656

          Show
          dhruba borthakur added a comment - I would probably take the first approach of deleting the tmp files during the upgrade. Related JIRAs are HADOOP-3677 and HADOOP-2656
          Hide
          dhruba borthakur added a comment -

          recoverTransitionRead invokes doUpgrade(). I plan to put in a check in doUpgrade() to remove the tmp directories from all storage directories. It is not very elegant, because doUpgrade() now needs to know that each storage dir contains a "tmp" directory, if anybody has any other suggestion for code layout, please let me know.

          Show
          dhruba borthakur added a comment - recoverTransitionRead invokes doUpgrade(). I plan to put in a check in doUpgrade() to remove the tmp directories from all storage directories. It is not very elegant, because doUpgrade() now needs to know that each storage dir contains a "tmp" directory, if anybody has any other suggestion for code layout, please let me know.
          Hide
          Raghu Angadi added a comment -

          Why not simply delete the block files that are not in 'genstamp' format? This does not depend on upgrade at all. This is how DN handles assigning default generation number initially.

          Show
          Raghu Angadi added a comment - Why not simply delete the block files that are not in 'genstamp' format? This does not depend on upgrade at all. This is how DN handles assigning default generation number initially.
          Hide
          dhruba borthakur added a comment -

          Actualy, on second thoughts, this does not seem to be a bug. For example, let's say that we have a cluster that is running 0.20 release of hadoop. When you restart a datanode, all files in the "tmp" directory need to be moved to the real block directory. The reason being that a FileSystem.sync() call demands that data once written to the datanode should persist. Thus, the problem you are seeing is not related to cluster upgrade.

          ≥these files end up failing block verification and eventually get deleted. But cause incorrect over-replication at the namenode before that.

          Maybe we should try to improve the situation here. Is it possible to put the blocks that got moved from "tmp" to the real block directory at the head of the block-to-verify list so that their CRC validation occur first?

          Another optimization (that i do not particularly like) is that the client can tell the datanode if the client has invoked FileSystem.sync(). The Datanode can persist this information. At datanode startup time, the Datanode moves only those blocks that were marked as "synced" earlier.

          Show
          dhruba borthakur added a comment - Actualy, on second thoughts, this does not seem to be a bug. For example, let's say that we have a cluster that is running 0.20 release of hadoop. When you restart a datanode, all files in the "tmp" directory need to be moved to the real block directory. The reason being that a FileSystem.sync() call demands that data once written to the datanode should persist. Thus, the problem you are seeing is not related to cluster upgrade. ≥these files end up failing block verification and eventually get deleted. But cause incorrect over-replication at the namenode before that. Maybe we should try to improve the situation here. Is it possible to put the blocks that got moved from "tmp" to the real block directory at the head of the block-to-verify list so that their CRC validation occur first? Another optimization (that i do not particularly like) is that the client can tell the datanode if the client has invoked FileSystem.sync(). The Datanode can persist this information. At datanode startup time, the Datanode moves only those blocks that were marked as "synced" earlier.
          Hide
          Raghu Angadi added a comment -

          Dhruba, these files are invalid at least on 0.17, and should not used next restart, no?

          Show
          Raghu Angadi added a comment - Dhruba, these files are invalid at least on 0.17, and should not used next restart, no?
          Hide
          Raghu Angadi added a comment -

          Block verification failure was actually a fortunate side effect that detected this bug. It is not a fix. Marking invalid blocks valid looks to me like a real data corruption and should not be allowed at all.

          Show
          Raghu Angadi added a comment - Block verification failure was actually a fortunate side effect that detected this bug. It is not a fix. Marking invalid blocks valid looks to me like a real data corruption and should not be allowed at all.
          Hide
          dhruba borthakur added a comment -

          Yeah, I agree. In the general case, the generation stamp incompability will cause these invalid blocks to get deleted. But the 0.18 upgrade allocates a generation stamp of 0 for all old blocks. This is a bug that should be fixed.

          Show
          dhruba borthakur added a comment - Yeah, I agree. In the general case, the generation stamp incompability will cause these invalid blocks to get deleted. But the 0.18 upgrade allocates a generation stamp of 0 for all old blocks. This is a bug that should be fixed.
          Hide
          dhruba borthakur added a comment -

          Here is the first version of the patch.

          Show
          dhruba borthakur added a comment - Here is the first version of the patch.
          Hide
          Raghu Angadi added a comment -

          Some description of what the patch does would be helpful for review.

          Show
          Raghu Angadi added a comment - Some description of what the patch does would be helpful for review.
          Hide
          dhruba borthakur added a comment -

          When a Datanode restarts, it attempts to move all file from the "tmp" directory to the block directory.

          What this patch does: If a file does not have a generation stamp associated with it, then it is not moved. The generation stamp is extracted form the corresponding metafile. The metafile could be in the tmp directory itself or it could already be in the block directory.

          Show
          dhruba borthakur added a comment - When a Datanode restarts, it attempts to move all file from the "tmp" directory to the block directory. What this patch does: If a file does not have a generation stamp associated with it, then it is not moved. The generation stamp is extracted form the corresponding metafile. The metafile could be in the tmp directory itself or it could already be in the block directory.
          Hide
          dhruba borthakur added a comment -

          Hi Raghu, do you have any feedback on the patch I posted earlier? Thanks

          Show
          dhruba borthakur added a comment - Hi Raghu, do you have any feedback on the patch I posted earlier? Thanks
          Hide
          Raghu Angadi added a comment -

          I will review this today. Did you get a chance to file the bug on rest of the issues related to files under tmp directory? Could you link that bug to this one? I think these are serious issues since these are essentially data corruptions.

          Show
          Raghu Angadi added a comment - I will review this today. Did you get a chance to file the bug on rest of the issues related to files under tmp directory? Could you link that bug to this one? I think these are serious issues since these are essentially data corruptions.
          Hide
          dhruba borthakur added a comment -

          The other bug is HADOOP-4702.

          Show
          dhruba borthakur added a comment - The other bug is HADOOP-4702 .
          Hide
          Raghu Angadi added a comment -

          hmm.. these are all side effects of the basic problem : Datanode does not know which files under /tmp are good and which are not. Before 0.18, all the files were bad.

          My preference would to fix this problem (in addition to HADOOP-4702), since it affects other things. There is another case where after an error, lease recovery succeeds one hour later on the datanode since DN thinks its temporary files are good.. resulting an another corrupt block. There could be more.

          Show
          Raghu Angadi added a comment - hmm.. these are all side effects of the basic problem : Datanode does not know which files under /tmp are good and which are not. Before 0.18, all the files were bad. My preference would to fix this problem (in addition to HADOOP-4702 ), since it affects other things. There is another case where after an error, lease recovery succeeds one hour later on the datanode since DN thinks its temporary files are good.. resulting an another corrupt block. There could be more.
          Hide
          dhruba borthakur added a comment -

          I agree that we need to solve HADOOP-4702.

          >here is another case ease recovery succeeds one hour later on the datanode since DN thinks its temporary files are good.. resulting an another corrupt block.

          Can you pl explain this scenario in greater detail? If the client encountered an error and exited, then the write was never completed. How did it create a corrupt block?

          Show
          dhruba borthakur added a comment - I agree that we need to solve HADOOP-4702 . >here is another case ease recovery succeeds one hour later on the datanode since DN thinks its temporary files are good.. resulting an another corrupt block. Can you pl explain this scenario in greater detail? If the client encountered an error and exited, then the write was never completed. How did it create a corrupt block?
          Hide
          Raghu Angadi added a comment -

          Nicholas has more details on the problem and thinks HADOOP-3574 would fix it. But I am not sure this approach of fixing each of the side effects as they are detected is a good one.

          It looks very error prone and wrong for datanode to not know which files under tmp are invalid since 0.18. do you not see that as a real problem?

          Show
          Raghu Angadi added a comment - Nicholas has more details on the problem and thinks HADOOP-3574 would fix it. But I am not sure this approach of fixing each of the side effects as they are detected is a good one. It looks very error prone and wrong for datanode to not know which files under tmp are invalid since 0.18. do you not see that as a real problem?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Raghu, I did not following this issue close enough. Why "datanode to not know which files under tmp are invalid since 0.18"?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Raghu, I did not following this issue close enough. Why "datanode to not know which files under tmp are invalid since 0.18"?
          Hide
          Raghu Angadi added a comment -

          > Why "datanode to not know which files under tmp are invalid since 0.18"?
          It just does not. Thats why it moves them to normal directory (this jira and HADOOP-4702), and incorrectly 'recovers' (HADOOP-3574) etc.

          Why? Because it does not have anything to distinguish between valid and invalid. Note that before 0.18 everything there is invalid.. Since you guys implemented it, it is better for you to explain , I guess.

          Show
          Raghu Angadi added a comment - > Why "datanode to not know which files under tmp are invalid since 0.18"? It just does not. Thats why it moves them to normal directory (this jira and HADOOP-4702 ), and incorrectly 'recovers' ( HADOOP-3574 ) etc. Why? Because it does not have anything to distinguish between valid and invalid. Note that before 0.18 everything there is invalid.. Since you guys implemented it, it is better for you to explain , I guess.
          Hide
          Raghu Angadi added a comment -

          Regd the patch :

          • Did you test the patch?
          • It looks like it still moves meta files to data directory (removes the actual block file).
          • Not sure why we check datanode's data directory for metadata file.. even if it exists, it would mostly be in some sub directory under datadir.
            • Related to the above : from the logic it looks like this might leave more than one metadata file in
          Show
          Raghu Angadi added a comment - Regd the patch : Did you test the patch? It looks like it still moves meta files to data directory (removes the actual block file). Not sure why we check datanode's data directory for metadata file.. even if it exists, it would mostly be in some sub directory under datadir. Related to the above : from the logic it looks like this might leave more than one metadata file in
          Hide
          dhruba borthakur added a comment -

          Thanks Raghu. I will restart work on this one.

          Show
          dhruba borthakur added a comment - Thanks Raghu. I will restart work on this one.
          Hide
          dhruba borthakur added a comment -

          When a datanode receives a replication request, it starts the block in a separate directory. These blocks will be cleaned up when the datanode restarts. The general algorithm is as follows:

          1. When a datanode receives a block from a client, it starts the block in datadir/tmp directory.
          2. When a datanode receives a block as part of a replication request, it starts the block in datadir//tmp_replication directory
          3. When a block is finalised, it moves from any of the above two temporary directories to its final location.
          4. When a datanode restarts, it moves all the blocks from datadir/tmp directory to their final location. These blocsk were part of writes by clients and the client-namenode-genstamp protocol will ensure that the most-uptodate replicas of the block will be maintained.
          4. When a datanode restarts, the blocks in datadir/tmp_replication directory are removed. No recovery is needed for these blocks.

          The question that remains is what do we need to do for blocks that are moved as part of rebalancer? I think the blocks that rebalancer moves should go into tmp_replication as well.

          Show
          dhruba borthakur added a comment - When a datanode receives a replication request, it starts the block in a separate directory. These blocks will be cleaned up when the datanode restarts. The general algorithm is as follows: 1. When a datanode receives a block from a client, it starts the block in datadir/tmp directory. 2. When a datanode receives a block as part of a replication request, it starts the block in datadir//tmp_replication directory 3. When a block is finalised, it moves from any of the above two temporary directories to its final location. 4. When a datanode restarts, it moves all the blocks from datadir/tmp directory to their final location. These blocsk were part of writes by clients and the client-namenode-genstamp protocol will ensure that the most-uptodate replicas of the block will be maintained. 4. When a datanode restarts, the blocks in datadir/tmp_replication directory are removed. No recovery is needed for these blocks. The question that remains is what do we need to do for blocks that are moved as part of rebalancer? I think the blocks that rebalancer moves should go into tmp_replication as well.
          Hide
          Raghu Angadi added a comment -

          A few comments :

          • since only the blocks that were part of writes could be valid, it it better to create them in "tmp_valid" or "tmp_writes" directory?
          • The rest of the blocks in tmp will be deleted automatically. This fixes the rebalancer issue as well. We can not leave any possible corruptions.
          • It is very crucial that any files left over should not cause corruption that we have already seen.
            • Simpler and better alternative for 0.18.3 might be to just say sync() is not supported and remove everything under tmp (same as 0.17).
            • This way it gives more time this jira to make sure there are no cases where it could cause corruption.
          • The patch move all the files to top level directory. But datanode data directories are organized to limit each directory to 64 blocks and 64 subdirectories. The current policy violates that. Overtime datanodes could end up having very large number of files in top level directory.
            • But this is not very important to fix in 0.18.. but it should be fixed in 0.19 or another line. A seperate jira might help.
          Show
          Raghu Angadi added a comment - A few comments : since only the blocks that were part of writes could be valid, it it better to create them in "tmp_valid" or "tmp_writes" directory? The rest of the blocks in tmp will be deleted automatically. This fixes the rebalancer issue as well. We can not leave any possible corruptions. It is very crucial that any files left over should not cause corruption that we have already seen. Simpler and better alternative for 0.18.3 might be to just say sync() is not supported and remove everything under tmp (same as 0.17). This way it gives more time this jira to make sure there are no cases where it could cause corruption. The patch move all the files to top level directory. But datanode data directories are organized to limit each directory to 64 blocks and 64 subdirectories. The current policy violates that. Overtime datanodes could end up having very large number of files in top level directory. But this is not very important to fix in 0.18.. but it should be fixed in 0.19 or another line. A seperate jira might help.
          Hide
          Konstantin Shvachko added a comment -

          Creating multiple "tmp" directories seems to be an indication of the design flaw.
          The question is why do we need to place the blocks that will be recovered to the main storage no matter what in a temporary directory instead of storing them in the main storage from the very beginning?
          Also, I don't see any "proof of correctness" here. How do we know this will not lead to the same or other problems the way the initial code did?

          Show
          Konstantin Shvachko added a comment - Creating multiple "tmp" directories seems to be an indication of the design flaw. The question is why do we need to place the blocks that will be recovered to the main storage no matter what in a temporary directory instead of storing them in the main storage from the very beginning? Also, I don't see any "proof of correctness" here. How do we know this will not lead to the same or other problems the way the initial code did?
          Hide
          dhruba borthakur added a comment -

          When a client is writing to a block on the datanode, it is good to create it in the "tmp" directory instead of the real block directory. This means that the datanode has a record that these blocks are not yet finalized and might need some recovery. In the typical case, existence of a block in the "tmp" directory means that the block is not yet confirmed to the namenode. In the current code, it is the client that actually triggers the recovery.

          Like Raghu proposed, it might be a good idea to name the original "tmp" directory as "tmp_writes". In fact, the blocks here are not temporary any more. It can be named as "blocks_inflight_dir" or something like that. However, if we change it now, won't we need an upgrade that can handle the case when the "tmp" directory existed?

          @Raghu: if we disable "sync" on 0.18.3, that will be an API incompatiblity from 0.18.2. Is that acceptable?

          Show
          dhruba borthakur added a comment - When a client is writing to a block on the datanode, it is good to create it in the "tmp" directory instead of the real block directory. This means that the datanode has a record that these blocks are not yet finalized and might need some recovery. In the typical case, existence of a block in the "tmp" directory means that the block is not yet confirmed to the namenode. In the current code, it is the client that actually triggers the recovery. Like Raghu proposed, it might be a good idea to name the original "tmp" directory as "tmp_writes". In fact, the blocks here are not temporary any more. It can be named as "blocks_inflight_dir" or something like that. However, if we change it now, won't we need an upgrade that can handle the case when the "tmp" directory existed? @Raghu: if we disable "sync" on 0.18.3, that will be an API incompatiblity from 0.18.2. Is that acceptable?
          Hide
          Raghu Angadi added a comment -

          > @Raghu: if we disable "sync" on 0.18.3, that will be an API incompatiblity from 0.18.2. Is that acceptable?

          I think that is ok. Of course we don't want to such a thing often but I am pretty sure most users would agree better to safe w.r.t data. At Y! no one is advised to use sync() and we are running with such a fix.

          Show
          Raghu Angadi added a comment - > @Raghu: if we disable "sync" on 0.18.3, that will be an API incompatiblity from 0.18.2. Is that acceptable? I think that is ok. Of course we don't want to such a thing often but I am pretty sure most users would agree better to safe w.r.t data. At Y! no one is advised to use sync() and we are running with such a fix.
          Hide
          Konstantin Shvachko added a comment -

          > In fact, the blocks here are not temporary any more.

          This is the key. They are not temporary so why do we keep them in tmp directory? Why DNs cannot report them to the name-node? It is just a matter of reporting the right length, right?

          > disable "sync" on 0.18.3, will be an API incompatiblity from 0.18.2.

          This is not incompatibility. The feature does not work correctly in 0.18.2 and it will still not work correctly in 0.18.3 with the advantage of not causing all the problems.
          I think removing everything under tmp as Raghu proposes is the right solution for 0.18.3.
          We should talk about the "real" fix in terms of 0.19 and up. This seems to be a consensus among colleagues around me.

          Show
          Konstantin Shvachko added a comment - > In fact, the blocks here are not temporary any more. This is the key. They are not temporary so why do we keep them in tmp directory? Why DNs cannot report them to the name-node? It is just a matter of reporting the right length, right? > disable "sync" on 0.18.3, will be an API incompatiblity from 0.18.2. This is not incompatibility. The feature does not work correctly in 0.18.2 and it will still not work correctly in 0.18.3 with the advantage of not causing all the problems. I think removing everything under tmp as Raghu proposes is the right solution for 0.18.3. We should talk about the "real" fix in terms of 0.19 and up. This seems to be a consensus among colleagues around me.
          Hide
          dhruba borthakur added a comment -

          The namenode always discards replicas that are smaller in size than other replicas. It always keeps the largest-size replica as the valid one. So, can this bug affect data integrity?

          Show
          dhruba borthakur added a comment - The namenode always discards replicas that are smaller in size than other replicas. It always keeps the largest-size replica as the valid one. So, can this bug affect data integrity?
          Hide
          dhruba borthakur added a comment -

          I will make a patch for 0.18 as Raghu/Konstantin suggested that disables the sync API. Will port it shortly.

          Show
          dhruba borthakur added a comment - I will make a patch for 0.18 as Raghu/Konstantin suggested that disables the sync API. Will port it shortly.
          Hide
          dhruba borthakur added a comment -

          Patch for 0.18. It throws an exception from the fsync() call and deletes all blocks from the "tmp" directory of the datadir.

          Show
          dhruba borthakur added a comment - Patch for 0.18. It throws an exception from the fsync() call and deletes all blocks from the "tmp" directory of the datadir.
          Hide
          Raghu Angadi added a comment -

          +1 from me for 0.18.3 fix. We might need to disable one or two unit tests.

          Show
          Raghu Angadi added a comment - +1 from me for 0.18.3 fix. We might need to disable one or two unit tests.
          Hide
          dhruba borthakur added a comment -

          Yes, some unit tests have to be disabled for 0.18.

          i am still worried that the fix for 0.18.3 causes an API change. Maybe hbase has to use it. Although I made the patch for 0.18, I am -1 for putting it into the general 018 release. I can put in config variables to disable "fsync" if needed.

          Show
          dhruba borthakur added a comment - Yes, some unit tests have to be disabled for 0.18. i am still worried that the fix for 0.18.3 causes an API change. Maybe hbase has to use it. Although I made the patch for 0.18, I am -1 for putting it into the general 018 release. I can put in config variables to disable "fsync" if needed.
          Hide
          Konstantin Shvachko added a comment -

          If you don't throw an exception from fsync() then there is now api change. In this case fsync() will work if data-nodes/clients don't fail it's just some sync-ed data may not survive cluster restarts. So hbase people will be able to write there programs with fsync(), but it will be guaranteed to work when they upgrade to newer versions where this issue is going to be fixed.

          Show
          Konstantin Shvachko added a comment - If you don't throw an exception from fsync() then there is now api change. In this case fsync() will work if data-nodes/clients don't fail it's just some sync-ed data may not survive cluster restarts. So hbase people will be able to write there programs with fsync(), but it will be guaranteed to work when they upgrade to newer versions where this issue is going to be fixed.
          Hide
          Konstantin Shvachko added a comment -

          May be it makes sense to create a separate jira for the fix for 0.18.3 so that we could close it when its done and continue the discussion about the real fix here. Or vice versa.

          Show
          Konstantin Shvachko added a comment - May be it makes sense to create a separate jira for the fix for 0.18.3 so that we could close it when its done and continue the discussion about the real fix here. Or vice versa.
          Hide
          Raghu Angadi added a comment -

          filed HADOOP-4997 for temporary work around.

          Show
          Raghu Angadi added a comment - filed HADOOP-4997 for temporary work around.
          Hide
          Owen O'Malley added a comment -

          For 0.18.3, we absolutely can not throw on fsync. It would be much better to silently do nothing. There are too many applications where throwing would break them.

          Show
          Owen O'Malley added a comment - For 0.18.3, we absolutely can not throw on fsync. It would be much better to silently do nothing. There are too many applications where throwing would break them.
          Hide
          Raghu Angadi added a comment -


          Micheal Stack from HBase confirmed that HBase on 0.18 does not use sync.

          Show
          Raghu Angadi added a comment - Micheal Stack from HBase confirmed that HBase on 0.18 does not use sync.
          Hide
          dhruba borthakur added a comment -

          Now that the 0.18.3 issues are handled by HADOOP-4997, let's discuss what we need to do for 0.19 and above. I propose that blocks that are created by client-writes be created in the "tmp" directory whereas blocks created by replcation requests be created in tmp_replication directory. On datanode restarts, the blocks in the "tmp" directory are reclaimed whereas the blocks in tmp_replication directory are discarded.

          The reason I propose to start client-generated blocks in the "tmp" directory (instead of the real block directory) is because these blocks are not yet confirmed to the namenode. They are still being written and ideally should not be included in any block report(s).

          Show
          dhruba borthakur added a comment - Now that the 0.18.3 issues are handled by HADOOP-4997 , let's discuss what we need to do for 0.19 and above. I propose that blocks that are created by client-writes be created in the "tmp" directory whereas blocks created by replcation requests be created in tmp_replication directory. On datanode restarts, the blocks in the "tmp" directory are reclaimed whereas the blocks in tmp_replication directory are discarded. The reason I propose to start client-generated blocks in the "tmp" directory (instead of the real block directory) is because these blocks are not yet confirmed to the namenode. They are still being written and ideally should not be included in any block report(s).
          Hide
          dhruba borthakur added a comment -

          I had an offline discussion with Sanjay, Rob Chansler, Nicholas and partly with Konstantin. Here is the summary:

          This JIRA will go into 0.19. (For 0.18.3, the equivalent work will be done via HADOOP-4997).

          The proposal is that blocks that are created by client-writes be created in the "blocks_being_written" directory whereas blocks created by replication requests be created in "blocks_being_replicated" directory. On datanode restarts, the blocks in the "tmp" directory and "blocks_being_replicated" directory are deleted whereas the blocks in "blocks_being_written" directory are recovered and promoted to the real block directory.

          Show
          dhruba borthakur added a comment - I had an offline discussion with Sanjay, Rob Chansler, Nicholas and partly with Konstantin. Here is the summary: This JIRA will go into 0.19. (For 0.18.3, the equivalent work will be done via HADOOP-4997 ). The proposal is that blocks that are created by client-writes be created in the "blocks_being_written" directory whereas blocks created by replication requests be created in "blocks_being_replicated" directory. On datanode restarts, the blocks in the "tmp" directory and "blocks_being_replicated" directory are deleted whereas the blocks in "blocks_being_written" directory are recovered and promoted to the real block directory.
          Hide
          Konstantin Shvachko added a comment -

          I think the proposed approach does not solve the problem.
          It improves in a sense current state but does not eliminate the problem completely.

          The idea of promoting blocks from "tmp" to the real storage is necessary to support sync(). We are trying to cover the following sequence of events:

          1. a client starts writing to a block and says sync();
          2. the data-node does the write and the sync() and then fails;
          3. the data-node restarts and the sync-ed block should appear as a valid block on this node, because the semantic of sync demands the sync-ed data to survive failures of clients, data-nodes and the name-node.

          It seams natural to promote blocks from tmp to the real storage during startup, but this caused problems because together with the sync-ed blocks the data-node also promotes all other potentially incomplete blocks from the tmp.

          One source of incomplete blocks in tmp is the internal block replication initiated by the name-node but not completed on the data-node due to a failure.

          (I) The proposal is to divide tmp into 2 directories
          "blocks_being_replicated" and "blocks_being_written". This excludes partially replicated
          blocks from being promoted to the real storage during data-node restarts.

          But this does not cover another source of incomplete blocks, which is the regular block writing by a client. It also can fail, remain incomplete, and will be promoted to the main storage during data-node restart.
          The question is why do we not care about these blocks?
          Why they cannot cause the same problems as the ones that are being replicated?

          Suppose I do not use sync-s. The incomplete blocks will still be promoted to the real storage, will be reported to the name-node and the name-node will have to process them and finally remove most of them. Isn't it a degradation in performance.

          (II) I'd rather consider dividing into two directories one having "transient" and another having "persistent" blocks. The transient blocks should be removed during startup, and persistent should be promoted into the real storage. Once sync-ed a block should be moved into the persistent directory.

          (III) Another variation is to promote sync-ed (persistent) blocks directly to the main storage. The problem here is to deal with blockReceived and blockReports, which I feel can be solved.

          (IV) Yet another approach is to keep the storage structure unchanged and write each sync-ed (finalized) portions of the block into main storage by appending that portion to the real block file. This will require an extra disk io for merging files (instead of renaming) but will let us discard everything that is in tmp as we do now.

          What we are trying to do with the directories is to assign properties to the block replicas and make them survive crashes. Previously there were just 2 properties final and transient. So we had 2 directories: tmp and main. Now we need a third property, which says the block is persistent but not finalized yet. So we tend to introduce yet another directory. I think this is too straightforward, because there is plenty of other approaches to implement boolean properties on entities.
          Why we are not considering them?

          Also worth noting this issue directly affects appends, because a replica being appended is first copied to the tmp directory and then treated as described above.

          Show
          Konstantin Shvachko added a comment - I think the proposed approach does not solve the problem. It improves in a sense current state but does not eliminate the problem completely. The idea of promoting blocks from "tmp" to the real storage is necessary to support sync(). We are trying to cover the following sequence of events: a client starts writing to a block and says sync(); the data-node does the write and the sync() and then fails; the data-node restarts and the sync-ed block should appear as a valid block on this node, because the semantic of sync demands the sync-ed data to survive failures of clients, data-nodes and the name-node. It seams natural to promote blocks from tmp to the real storage during startup, but this caused problems because together with the sync-ed blocks the data-node also promotes all other potentially incomplete blocks from the tmp. One source of incomplete blocks in tmp is the internal block replication initiated by the name-node but not completed on the data-node due to a failure. (I) The proposal is to divide tmp into 2 directories "blocks_being_replicated" and "blocks_being_written". This excludes partially replicated blocks from being promoted to the real storage during data-node restarts. But this does not cover another source of incomplete blocks, which is the regular block writing by a client. It also can fail, remain incomplete, and will be promoted to the main storage during data-node restart. The question is why do we not care about these blocks? Why they cannot cause the same problems as the ones that are being replicated? Suppose I do not use sync-s. The incomplete blocks will still be promoted to the real storage, will be reported to the name-node and the name-node will have to process them and finally remove most of them. Isn't it a degradation in performance. (II) I'd rather consider dividing into two directories one having "transient" and another having "persistent" blocks. The transient blocks should be removed during startup, and persistent should be promoted into the real storage. Once sync-ed a block should be moved into the persistent directory. (III) Another variation is to promote sync-ed (persistent) blocks directly to the main storage. The problem here is to deal with blockReceived and blockReports, which I feel can be solved. (IV) Yet another approach is to keep the storage structure unchanged and write each sync-ed (finalized) portions of the block into main storage by appending that portion to the real block file. This will require an extra disk io for merging files (instead of renaming) but will let us discard everything that is in tmp as we do now. What we are trying to do with the directories is to assign properties to the block replicas and make them survive crashes. Previously there were just 2 properties final and transient. So we had 2 directories: tmp and main. Now we need a third property, which says the block is persistent but not finalized yet. So we tend to introduce yet another directory. I think this is too straightforward, because there is plenty of other approaches to implement boolean properties on entities. Why we are not considering them? Also worth noting this issue directly affects appends, because a replica being appended is first copied to the tmp directory and then treated as described above.
          Hide
          dhruba borthakur added a comment -

          I think the issue that Konstantin is raising is based on the assumption that data that is not "synced" by a client should not appear in a file. This assumption is not true as specified in the design document for Appends. The guarantee is that data that is written prior to a "sync" call will be seen by all new readers of the file.

          Just like in any UNIX-y systems, data that is not "sync" can still be seen by other concurrent readers.

          Show
          dhruba borthakur added a comment - I think the issue that Konstantin is raising is based on the assumption that data that is not "synced" by a client should not appear in a file. This assumption is not true as specified in the design document for Appends. The guarantee is that data that is written prior to a "sync" call will be seen by all new readers of the file. Just like in any UNIX-y systems, data that is not "sync" can still be seen by other concurrent readers.
          Hide
          dhruba borthakur added a comment -

          This patch creates all blocks that are part of client write request to be stored in blocksBeingWritten directory. Blocsk that are part of replication requests and created in blocksBeingReplicated directory.

          A datanode restarts removes all blocks from the "tmp" and blocksBeingReplicated directory and moves all blocks from the blocksBeingWritten directory into the real block directory.

          Show
          dhruba borthakur added a comment - This patch creates all blocks that are part of client write request to be stored in blocksBeingWritten directory. Blocsk that are part of replication requests and created in blocksBeingReplicated directory. A datanode restarts removes all blocks from the "tmp" and blocksBeingReplicated directory and moves all blocks from the blocksBeingWritten directory into the real block directory.
          Hide
          Konstantin Shvachko added a comment -

          I think the assumption Dhruba is making about my motivation is not true.
          The problem is that the proposed solution does not work.
          And I said that in the first line of that long comment above.

          Show
          Konstantin Shvachko added a comment - I think the assumption Dhruba is making about my motivation is not true. The problem is that the proposed solution does not work . And I said that in the first line of that long comment above.
          Hide
          dhruba borthakur added a comment -

          Hi Konstantin, can you pl explain (sorry I was not able to understand it from your description even after reading it many many times) why the proposed solution and patch does not work?

          Show
          dhruba borthakur added a comment - Hi Konstantin, can you pl explain (sorry I was not able to understand it from your description even after reading it many many times) why the proposed solution and patch does not work?
          Hide
          Robert Chansler added a comment -

          Large consensus (Dhruba, Nicholas, Sanjay, Konstantin, Rob) that this is not appropriate for 18.3. Hadoop:4997 will substitute for this in 18.3.

          Show
          Robert Chansler added a comment - Large consensus (Dhruba, Nicholas, Sanjay, Konstantin, Rob) that this is not appropriate for 18.3. Hadoop:4997 will substitute for this in 18.3.
          Hide
          Konstantin Shvachko added a comment -

          Hi Dhruba. There were several issues (we actually had a data loss) caused by promoting blocks from tmp to main storage. One of them is HADOOP-4702, which was probably when you were on vacation.
          The problem was that partial blocks were promoted to the main storage even though they were transient.
          I am arguing that your solution does not eliminate this condition. blocksBeingWritten can still contain transient blocks and they will still be promoted to the main storage. A simple example is when you write a block (without using sync()) and the DN fails in the middle leaving an incomplete (transient) block in blocksBeingWritten directory. This block will be moved to the main storage upon DN restart although it should be treated exactly as the blocks in blocksBeingReplicated directory are, because neither the block nor a part of it was ever finalized.
          Does it make more sense?

          Show
          Konstantin Shvachko added a comment - Hi Dhruba. There were several issues (we actually had a data loss) caused by promoting blocks from tmp to main storage. One of them is HADOOP-4702 , which was probably when you were on vacation. The problem was that partial blocks were promoted to the main storage even though they were transient. I am arguing that your solution does not eliminate this condition. blocksBeingWritten can still contain transient blocks and they will still be promoted to the main storage. A simple example is when you write a block (without using sync()) and the DN fails in the middle leaving an incomplete (transient) block in blocksBeingWritten directory. This block will be moved to the main storage upon DN restart although it should be treated exactly as the blocks in blocksBeingReplicated directory are, because neither the block nor a part of it was ever finalized. Does it make more sense?
          Hide
          dhruba borthakur added a comment -

          In the "append world" (i.e 0.19), the semantics that we provide is that data written by a writer may be seen by other readers even if the writer has not invoked "sync". HDFS makes every effort (within reasonable cost limits) to persist data that is written to a file. Moreover, the "append" protocol uses the generation stamp of the block to accurately determine stale blocks, thus it is safe to promote blocks from the "blocksBeingWritten" directory to the real block directory.

          The data corruption you have seen occured because the generation-stamp-update-procotol is not triggered during a block transfer request. This patch correctly handles block-trasfer-requests and should prevent the data corruption issue from occuring.

          Show
          dhruba borthakur added a comment - In the "append world" (i.e 0.19), the semantics that we provide is that data written by a writer may be seen by other readers even if the writer has not invoked "sync". HDFS makes every effort (within reasonable cost limits) to persist data that is written to a file. Moreover, the "append" protocol uses the generation stamp of the block to accurately determine stale blocks, thus it is safe to promote blocks from the "blocksBeingWritten" directory to the real block directory. The data corruption you have seen occured because the generation-stamp-update-procotol is not triggered during a block transfer request. This patch correctly handles block-trasfer-requests and should prevent the data corruption issue from occuring.
          Hide
          Raghu Angadi added a comment -

          > The data corruption you have seen occured because the generation-stamp-update-procotol is not triggered
          > during a block transfer request. This patch correctly handles block-trasfer-requests and should prevent the data
          > corruption issue from occuring.

          Many different types of data corruption occurred recently with 0.18.. mainly because of combination of bugs.

          The corruptions caused by the issue in this jira has little to do with generation stamp for transfers. Primary cause is this :

          1. DN promotes all files created in 0.17 from /tmp directory that it should never have done.
          2. When it moved the files it did not generate a gen stamp for metadata files.
          3. DN reports those blocks as valid to NN.
          4. Later DN marks these files as corrupt since there is no metadata.

          The above is one of the biggest source of corruptions. There were various other bugs that contributed, many of these were fixed in 0.18.3.

          Show
          Raghu Angadi added a comment - > The data corruption you have seen occured because the generation-stamp-update-procotol is not triggered > during a block transfer request. This patch correctly handles block-trasfer-requests and should prevent the data > corruption issue from occuring. Many different types of data corruption occurred recently with 0.18.. mainly because of combination of bugs. The corruptions caused by the issue in this jira has little to do with generation stamp for transfers. Primary cause is this : DN promotes all files created in 0.17 from /tmp directory that it should never have done. When it moved the files it did not generate a gen stamp for metadata files. DN reports those blocks as valid to NN. Later DN marks these files as corrupt since there is no metadata. The above is one of the biggest source of corruptions. There were various other bugs that contributed, many of these were fixed in 0.18.3.
          Hide
          Konstantin Shvachko added a comment -

          Append or not if we want files be visible by other clients there should be a logic which reads files in the tmp directory. And this is not related to promoting incomplete files. We do not have guarantees for the data to survive crashes if it has not been sync-ed even if another client has seen it.

          > the "append" protocol uses the generation stamp to accurately determine stale blocks
          > it is safe to promote blocks from the "blocksBeingWritten" directory

          If the protocol worked correctly it would be safe to promote any incomplete blocks including those in blocksBeingReplicated but it wasn't.

          I would rather not promote unsynced files because

          1. it turned to be error-prone
          2. it adds performance overhead, when sync or append is not used.
          Show
          Konstantin Shvachko added a comment - Append or not if we want files be visible by other clients there should be a logic which reads files in the tmp directory. And this is not related to promoting incomplete files. We do not have guarantees for the data to survive crashes if it has not been sync-ed even if another client has seen it. > the "append" protocol uses the generation stamp to accurately determine stale blocks > it is safe to promote blocks from the "blocksBeingWritten" directory If the protocol worked correctly it would be safe to promote any incomplete blocks including those in blocksBeingReplicated but it wasn't. I would rather not promote unsynced files because it turned to be error-prone it adds performance overhead, when sync or append is not used.
          Hide
          dhruba borthakur added a comment -

          The append design explicitly states that the system should make every effort to persist data written to a file even is sync is not invoked. "sync" has some cost associated with it, and an application might not want to incur the cost at every write, but t would definitely like HDFS to make the best effort to persist that data it has written.

          > it turned to be error-prone

          The design is not error prone, it was the implementation that had some shortcomings. In particular, blocks that are part of replication requests should not have been promoted. This is fixed by this patch. If you are aware of other bugs in this area that can cause problems, it would be really nice if you can list it out or create a unit test to trigger that bug. I understand that the "append" protocol is complex, but when implemented correctly should be fool-proof.

          >it adds performance overhead, when sync or append is not used.

          are you concerned that renaming the block file from the blocksBeingWritten directory to the main block directory adds overhead to the write-pipeline?

          Show
          dhruba borthakur added a comment - The append design explicitly states that the system should make every effort to persist data written to a file even is sync is not invoked. "sync" has some cost associated with it, and an application might not want to incur the cost at every write, but t would definitely like HDFS to make the best effort to persist that data it has written. > it turned to be error-prone The design is not error prone, it was the implementation that had some shortcomings. In particular, blocks that are part of replication requests should not have been promoted. This is fixed by this patch. If you are aware of other bugs in this area that can cause problems, it would be really nice if you can list it out or create a unit test to trigger that bug. I understand that the "append" protocol is complex, but when implemented correctly should be fool-proof. >it adds performance overhead, when sync or append is not used. are you concerned that renaming the block file from the blocksBeingWritten directory to the main block directory adds overhead to the write-pipeline?
          Hide
          Konstantin Shvachko added a comment -

          > The append design explicitly states that the system should make every effort to persist data

          This is listed under section "The non-goals of this design are:"
          Which design are we talking about anyway? The document attached to 1700 is 8 months behind the patch.

          > In particular, blocks that are part of replication requests should not have been promoted.

          Why? What makes them different from incomplete blocks that are a part of client creates? Same blocks.

          > it adds performance overhead, when sync or append is not used.

          I am concerned that incomplete blocks will be promoted, then sent (reported) to the name-node, then processed there and finally most of them will be removed. It's the name-node overhead which is a concern not the data-node.

          Show
          Konstantin Shvachko added a comment - > The append design explicitly states that the system should make every effort to persist data This is listed under section "The non-goals of this design are:" Which design are we talking about anyway? The document attached to 1700 is 8 months behind the patch. > In particular, blocks that are part of replication requests should not have been promoted. Why? What makes them different from incomplete blocks that are a part of client creates? Same blocks. > it adds performance overhead, when sync or append is not used. I am concerned that incomplete blocks will be promoted, then sent (reported) to the name-node, then processed there and finally most of them will be removed. It's the name-node overhead which is a concern not the data-node.
          Hide
          dhruba borthakur added a comment -

          > Which design are we talking about anyway? The document attached to 1700 is 8 months behind the patch.

          The design document for Appends is still a valid document. It is true that the patch took a long time to develop.

          > I am concerned that incomplete blocks will be promoted, then sent (reported) to the name-node, then processed there and finally most of them will be removed. It's the name-node overhead which is a concern not the data-node.

          Ok, so it is not about correctness, but rather a performance question. I will run some tests on how much this can add to performance overhead. Will report my findings soon. The reason I like promoting blocks to the real directory (only when the datanode crashes) is because this is data that an application has written and I would rather save it than delete it. From my viewpoint, the system should make every effort to persist this data, rather than saying that "ok, you did not invoke sync, so you lose your data". (I remember a discussion with Sameer saying that it would be nice to have every new block allocation at the namenode be persisted, and persisting the block list at the namenode is useless if the datanode anyways deletes blocks that were not closed).

          Show
          dhruba borthakur added a comment - > Which design are we talking about anyway? The document attached to 1700 is 8 months behind the patch. The design document for Appends is still a valid document. It is true that the patch took a long time to develop. > I am concerned that incomplete blocks will be promoted, then sent (reported) to the name-node, then processed there and finally most of them will be removed. It's the name-node overhead which is a concern not the data-node. Ok, so it is not about correctness, but rather a performance question. I will run some tests on how much this can add to performance overhead. Will report my findings soon. The reason I like promoting blocks to the real directory (only when the datanode crashes) is because this is data that an application has written and I would rather save it than delete it. From my viewpoint, the system should make every effort to persist this data, rather than saying that "ok, you did not invoke sync, so you lose your data". (I remember a discussion with Sameer saying that it would be nice to have every new block allocation at the namenode be persisted, and persisting the block list at the namenode is useless if the datanode anyways deletes blocks that were not closed).
          Hide
          Konstantin Shvachko added a comment -

          > Ok, so it is not about correctness, but rather a performance question.

          It is about both. You did not answer my questions.
          My point is that if the processing was correct than incomplete blocks whether they are generated by clients or during block replication would be handled correctly, but they were not.
          So fixing the problem for a part of them (those generated during replication) will not solve the problem completely because the client generated incomplete blocks are still there.

          Show
          Konstantin Shvachko added a comment - > Ok, so it is not about correctness, but rather a performance question. It is about both. You did not answer my questions. My point is that if the processing was correct than incomplete blocks whether they are generated by clients or during block replication would be handled correctly, but they were not. So fixing the problem for a part of them (those generated during replication) will not solve the problem completely because the client generated incomplete blocks are still there.
          Hide
          dhruba borthakur added a comment -

          >will not solve the problem completely because the client generated incomplete blocks are still there

          @Konstantin: it appears that you feel that applying this patch might still leave a hole. I am unable to visualize a scenario where this could cause a bug. I think the generation-stamp-protocol is adequate. I would really appreciate it if you can write a unit test that can demonstrate this bug (on trunk + the patch associated with this JIRA).

          Show
          dhruba borthakur added a comment - >will not solve the problem completely because the client generated incomplete blocks are still there @Konstantin: it appears that you feel that applying this patch might still leave a hole. I am unable to visualize a scenario where this could cause a bug. I think the generation-stamp-protocol is adequate. I would really appreciate it if you can write a unit test that can demonstrate this bug (on trunk + the patch associated with this JIRA).
          Hide
          Hairong Kuang added a comment -

          Is it possible that DataNode leaves the blocks under tmp untouched at the startup time? Instead it leaves those blocks for the lease recovery process to prompt them. When a DataNode starts up, it reads blocks under tmp and put them to OngoingCreates data structure. It then reports them to NN. If NN sees a tmp block that is not the last block of an under-construction file, mark it as corrupt; Otherwise, this is really an under construction block and NN adds it to the targets set of the file. Later when the file's lease expires, NN will close the file and those blocks will be finalized.

          The idea is to start DataNode from the same state when it was down. Prompting blocks at the startup time provides a possibility of polluting dfs data.

          Show
          Hairong Kuang added a comment - Is it possible that DataNode leaves the blocks under tmp untouched at the startup time? Instead it leaves those blocks for the lease recovery process to prompt them. When a DataNode starts up, it reads blocks under tmp and put them to OngoingCreates data structure. It then reports them to NN. If NN sees a tmp block that is not the last block of an under-construction file, mark it as corrupt; Otherwise, this is really an under construction block and NN adds it to the targets set of the file. Later when the file's lease expires, NN will close the file and those blocks will be finalized. The idea is to start DataNode from the same state when it was down. Prompting blocks at the startup time provides a possibility of polluting dfs data.
          Hide
          Konstantin Shvachko added a comment -
          Show
          Konstantin Shvachko added a comment - See related comment here
          Hide
          dhruba borthakur added a comment -

          @Hairong: Your proposal should mostly work.

          But, I still do not understand the problem that can result while promoting blocks from "blocksBeingWritten" directory when datanode restarts. The generation-stamp-protocol takes care of distinguishing bad replicas from good replicas.

          The design is that the NN block report processing (i.e. addStoredBlock) should completely ignore blocks that are under construction.

          Show
          dhruba borthakur added a comment - @Hairong: Your proposal should mostly work. But, I still do not understand the problem that can result while promoting blocks from "blocksBeingWritten" directory when datanode restarts. The generation-stamp-protocol takes care of distinguishing bad replicas from good replicas. The design is that the NN block report processing (i.e. addStoredBlock) should completely ignore blocks that are under construction.
          Hide
          Hairong Kuang added a comment -

          I do believe that the generation-stamp-protocol should work. But I do not like the idea of prompting blocks under the tmp directory. Blocks under tmp are under-construction. Blindly finalizing them explicitly introduces polluted blocks to the system and it completely depends on NN to clean them up. This design seems to me not clean. Any minor error on the NN side might cost a lot. Bugs like HADOOP-4810 caused Yahoo to lose quite amount of data and introduced problems like HADOOP-4692 and other problems that we could not identify the cause yet. I think it would be nice that DataNodes do not introduce pollution in the first place.

          Show
          Hairong Kuang added a comment - I do believe that the generation-stamp-protocol should work. But I do not like the idea of prompting blocks under the tmp directory. Blocks under tmp are under-construction. Blindly finalizing them explicitly introduces polluted blocks to the system and it completely depends on NN to clean them up. This design seems to me not clean. Any minor error on the NN side might cost a lot. Bugs like HADOOP-4810 caused Yahoo to lose quite amount of data and introduced problems like HADOOP-4692 and other problems that we could not identify the cause yet. I think it would be nice that DataNodes do not introduce pollution in the first place.
          Hide
          Raghu Angadi added a comment -

          @Dhruba, could you check if following analysis of a possible corruption correct? :

          Say a block is being written and gen stamp is consistent across all three datanodes (common case) :

          • Say block sizes after a cluster restart are : x+5, x+10, and x+15 (on three datanode respectively). But this does not mean checksum file is correct since DataNode could be killed any time even OS could restart.
          • When datanodes join the cluster, blocks on D1 and D2 will be deleted since they are smaller than block on D3. Later block on D3 will be reported as corrupt since Checksums don't match. This is hard corruption.
          • Note what we will lose any data on the block that was synced earlier.

          > I am unable to visualize a scenario where this could cause a bug.

          It does not imply it is correct. The most important job of HDFS is keep the data intact. It should take priority over new features or scheduled. IMHO The current approach of "it is correct until proven wrong" does not really suit for critical parts. For e.g. couple of months back there were no known corruption issues.. but we later saw many such issues.

          I am not saying we can prove everything. But we should be conservative and do only what is already known to be correct. In this case, the block files are valid only until the last sync.. so truncating to last sync (or to some known to be good length) is better.

          Show
          Raghu Angadi added a comment - @Dhruba, could you check if following analysis of a possible corruption correct? : Say a block is being written and gen stamp is consistent across all three datanodes (common case) : Say block sizes after a cluster restart are : x+5, x+10, and x+15 (on three datanode respectively). But this does not mean checksum file is correct since DataNode could be killed any time even OS could restart. When datanodes join the cluster, blocks on D1 and D2 will be deleted since they are smaller than block on D3. Later block on D3 will be reported as corrupt since Checksums don't match. This is hard corruption. Note what we will lose any data on the block that was synced earlier. > I am unable to visualize a scenario where this could cause a bug. It does not imply it is correct. The most important job of HDFS is keep the data intact. It should take priority over new features or scheduled. IMHO The current approach of "it is correct until proven wrong" does not really suit for critical parts. For e.g. couple of months back there were no known corruption issues.. but we later saw many such issues. I am not saying we can prove everything. But we should be conservative and do only what is already known to be correct. In this case, the block files are valid only until the last sync.. so truncating to last sync (or to some known to be good length) is better.
          Hide
          dhruba borthakur added a comment -

          Hairong, I understand your point of view. A bug in the generation-stamp protocol could cause data corruption.

          The alternative is to not promote blocks to the main storage when the datanode restarts. In this case, the data that was written by the client to that datanode is lost. This is data written by an application, and it would be nice if HDFS can make every effort to recover this data instead of deleting it. This is true with most other file systems. For example, if an application writes some data to an ext3 file and then dies before closing/fsync it, the OS/FS does not delete the data that was cached in the OS pages. It makes every effort to persist it. If we can have similar semantics for HDFS, it will be a more powerful system, isn't it? An HDFS application that does not issue a "fsync", cannot rely on the fact that all the data it has written will be persisted, but as long as HDFS makes a good effort to keep all the data, that will be nice, isn't it?

          So, this issue all boils down to the tradeoff of having a "file system that never persists data unless the writer explicitly invoked a fsync" verses "complexity of the namenode thereby introducing buggy code".

          I can vote for one application that we run inhouse that would definitely like the behaviour where HDFS makes every effort to persist data (rather than invoking sync frequently). HBase can use this feature too (in the future) to recover HBase transactions that lie beyond the sync point (good to have, not a hard requirement).

          Show
          dhruba borthakur added a comment - Hairong, I understand your point of view. A bug in the generation-stamp protocol could cause data corruption. The alternative is to not promote blocks to the main storage when the datanode restarts. In this case, the data that was written by the client to that datanode is lost. This is data written by an application, and it would be nice if HDFS can make every effort to recover this data instead of deleting it. This is true with most other file systems. For example, if an application writes some data to an ext3 file and then dies before closing/fsync it, the OS/FS does not delete the data that was cached in the OS pages. It makes every effort to persist it. If we can have similar semantics for HDFS, it will be a more powerful system, isn't it? An HDFS application that does not issue a "fsync", cannot rely on the fact that all the data it has written will be persisted, but as long as HDFS makes a good effort to keep all the data, that will be nice, isn't it? So, this issue all boils down to the tradeoff of having a "file system that never persists data unless the writer explicitly invoked a fsync" verses "complexity of the namenode thereby introducing buggy code". I can vote for one application that we run inhouse that would definitely like the behaviour where HDFS makes every effort to persist data (rather than invoking sync frequently). HBase can use this feature too (in the future) to recover HBase transactions that lie beyond the sync point (good to have, not a hard requirement).
          Hide
          dhruba borthakur added a comment -

          @Raghu. In your test case, the NN knows that the file is under construction. addStoredBlock() as well as a block report should not touch this block at all. The existence of a lease record implies that the NN has relinquished control of this file to the writer/datanode pair. Hence, the NN cannot make any decisions (corrupt block? replciate block?) etc for this block.

          Show
          dhruba borthakur added a comment - @Raghu. In your test case, the NN knows that the file is under construction. addStoredBlock() as well as a block report should not touch this block at all. The existence of a lease record implies that the NN has relinquished control of this file to the writer/datanode pair. Hence, the NN cannot make any decisions (corrupt block? replciate block?) etc for this block.
          Hide
          Raghu Angadi added a comment -

          > In your test case, the NN knows that the file is under construction.

          Assuming client is also restarted, when does the lease expire? Does the corruption happen after the expiry?

          thanks.

          Show
          Raghu Angadi added a comment - > In your test case, the NN knows that the file is under construction. Assuming client is also restarted, when does the lease expire? Does the corruption happen after the expiry? thanks.
          Hide
          Hairong Kuang added a comment -

          Dhruba, for the idea that I proposed, we still preserve the data under tmp. Block finalization is delayed until its file's lease is expired. NN will initiate lease recovery and close the file. Also the proposal tries to restore data structures both at NN and DN side when dfs was down. So any synced data should be available to any reader if they were available to a reader before dfs was down.

          Show
          Hairong Kuang added a comment - Dhruba, for the idea that I proposed, we still preserve the data under tmp. Block finalization is delayed until its file's lease is expired. NN will initiate lease recovery and close the file. Also the proposal tries to restore data structures both at NN and DN side when dfs was down. So any synced data should be available to any reader if they were available to a reader before dfs was down.
          Hide
          dhruba borthakur added a comment -

          @Raghu: The hard-limit-lease-timeout is currently one hour. The NN will start lease recovery after 1 hour. The lease recovery process locates the right replicas (with the same size) bumps up the generation stamp on all these datanode(s), and persists the new block id (that has the new generation stamp) into the Inode. Since the inode map now has a new block id (because of the change in gen stamp), all old replicas that have the old generation stamp do not belong to any inode. If such a old block checks in with the NN via a block report, it will get deleted. Does it sound right?

          @Hairong: I get your proposal. I am starting to like it a lot!
          >When a DataNode starts up, it reads blocks under tmp and put them to OngoingCreates data structure
          Do we really need to do this? Even if we leave the blocks in the tmp directory, it will be ok isn't it? A block report won't contain this block. But how does it matter because the NN-blockreport-processing will always ignore the last block of a file that is under construction. Now, when lease recovery occurs, this block could move from the tmp directory to the real block directory.

          Show
          dhruba borthakur added a comment - @Raghu: The hard-limit-lease-timeout is currently one hour. The NN will start lease recovery after 1 hour. The lease recovery process locates the right replicas (with the same size) bumps up the generation stamp on all these datanode(s), and persists the new block id (that has the new generation stamp) into the Inode. Since the inode map now has a new block id (because of the change in gen stamp), all old replicas that have the old generation stamp do not belong to any inode. If such a old block checks in with the NN via a block report, it will get deleted. Does it sound right? @Hairong: I get your proposal. I am starting to like it a lot! >When a DataNode starts up, it reads blocks under tmp and put them to OngoingCreates data structure Do we really need to do this? Even if we leave the blocks in the tmp directory, it will be ok isn't it? A block report won't contain this block. But how does it matter because the NN-blockreport-processing will always ignore the last block of a file that is under construction. Now, when lease recovery occurs, this block could move from the tmp directory to the real block directory.
          Hide
          dhruba borthakur added a comment -

          Hairong's proposal:

          DataNode leaves the blocks under tmp untouched at the startup time. Instead it leaves those blocks for the lease recovery process to prompt them. When a DataNode starts up, it reads blocks under tmp and put them to OngoingCreates data structure. A block report does not contain this block, but it is ok. The NN block report processing always ignores the last block of a file under construction.

          Show
          dhruba borthakur added a comment - Hairong's proposal: DataNode leaves the blocks under tmp untouched at the startup time. Instead it leaves those blocks for the lease recovery process to prompt them. When a DataNode starts up, it reads blocks under tmp and put them to OngoingCreates data structure. A block report does not contain this block, but it is ok. The NN block report processing always ignores the last block of a file under construction.
          Hide
          Raghu Angadi added a comment -

          @Raghu: The hard-limit-lease-timeout is currently one hour. The NN will start lease recovery after 1 hour. The lease recovery process locates the right replicas (with the same size) bumps up the generation stamp on all these datanode(s), and persists the new block id (that has the new generation stamp) into the Inode. Since the inode map now has a new block id (because of the change in gen stamp), all old replicas that have the old generation stamp do not belong to any inode. If such a old block checks in with the NN via a block report, it will get deleted. Does it sound right?

          From the scenario I outlined above, will it work? All the blocks have same gen-stamp and have (slightly) different lengths. For the given scenario above, could you tell how it ends up with a valid block? Sorry you might have already answered in above comment.. but I could not easily see it.

          Show
          Raghu Angadi added a comment - @Raghu: The hard-limit-lease-timeout is currently one hour. The NN will start lease recovery after 1 hour. The lease recovery process locates the right replicas (with the same size) bumps up the generation stamp on all these datanode(s), and persists the new block id (that has the new generation stamp) into the Inode. Since the inode map now has a new block id (because of the change in gen stamp), all old replicas that have the old generation stamp do not belong to any inode. If such a old block checks in with the NN via a block report, it will get deleted. Does it sound right? From the scenario I outlined above, will it work? All the blocks have same gen-stamp and have (slightly) different lengths. For the given scenario above, could you tell how it ends up with a valid block? Sorry you might have already answered in above comment.. but I could not easily see it.
          Hide
          Raghu Angadi added a comment -

          > If such a old block checks in with the NN via a block report, it will get deleted. Does it sound right?
          So we do lose that block. Isn't that a problem?

          Show
          Raghu Angadi added a comment - > If such a old block checks in with the NN via a block report, it will get deleted. Does it sound right? So we do lose that block. Isn't that a problem?
          Hide
          dhruba borthakur added a comment -

          > From the scenario I outlined above, will it work? All the blocks have same gen-stamp and have (slightly) different lengths. For the given scenario above, could you tell how it ends up with a valid block?

          This is the core of the generation-stamp recovery logic. The primary datanode contacts each of the datanode(s) in the pipeline and retrieves the generation stamp and the length of each of the replicas. It picks as valid only those replicas that equals or exceeds the generation stamp stored in the NN. For each of these "valid" replicas, it picks only those replica(s) that have the smallest size. These are the "real valid" replicas of the block. The other replicas can be deleted.

          The primary datanode then stamps all those "valid" replicas with a new generation stamp and updates the namenode inode list with this new generation stamp for this block.

          Show
          dhruba borthakur added a comment - > From the scenario I outlined above, will it work? All the blocks have same gen-stamp and have (slightly) different lengths. For the given scenario above, could you tell how it ends up with a valid block? This is the core of the generation-stamp recovery logic. The primary datanode contacts each of the datanode(s) in the pipeline and retrieves the generation stamp and the length of each of the replicas. It picks as valid only those replicas that equals or exceeds the generation stamp stored in the NN. For each of these "valid" replicas, it picks only those replica(s) that have the smallest size. These are the "real valid" replicas of the block. The other replicas can be deleted. The primary datanode then stamps all those "valid" replicas with a new generation stamp and updates the namenode inode list with this new generation stamp for this block.
          Hide
          Raghu Angadi added a comment -

          > The primary datanode then stamps all those "valid" replicas with a new generation stamp and updates the namenode inode list with this new generation stamp for this block.

          Thanks that is useful. I am still trying deduce whether you are implying there will be a corruption or a valid replica.

          As I see it, in this case there will be one replica with 'x+5' bytes with the new gen-stamp. But given that checksum does not match (no sync to native filesystem is done, all three datanodes are killed while they are in wrte() system call), looks like we end up with corruption.. right? I am missing something?

          Show
          Raghu Angadi added a comment - > The primary datanode then stamps all those "valid" replicas with a new generation stamp and updates the namenode inode list with this new generation stamp for this block. Thanks that is useful. I am still trying deduce whether you are implying there will be a corruption or a valid replica. As I see it, in this case there will be one replica with 'x+5' bytes with the new gen-stamp. But given that checksum does not match (no sync to native filesystem is done, all three datanodes are killed while they are in wrte() system call), looks like we end up with corruption.. right? I am missing something?
          Hide
          dhruba borthakur added a comment - - edited

          An offline discussion with Raghu resulted in this proposal (slight modification to Hairong's proposal):

          The Datanode, on startup, verifies the length of each block in the "blocksBeingWrtitten" directory with their corresponding meta files lengths (and truncates block file if necessary to match meta file). It inserts these blocks in ongoingCreates.It then leaves those blocks for the lease recovery process to prompt them. A block report does not contain this block, but it is ok. The NN block report processing always ignores the last block of a file under construction.

          Show
          dhruba borthakur added a comment - - edited An offline discussion with Raghu resulted in this proposal (slight modification to Hairong's proposal): The Datanode, on startup, verifies the length of each block in the "blocksBeingWrtitten" directory with their corresponding meta files lengths (and truncates block file if necessary to match meta file). It inserts these blocks in ongoingCreates.It then leaves those blocks for the lease recovery process to prompt them. A block report does not contain this block, but it is ok. The NN block report processing always ignores the last block of a file under construction.
          Hide
          Hairong Kuang added a comment -

          This sounds good to me. In addition, DN still needs to report NN all blocks in the tmp directory and NN either instructs DN to delete a report tmp block if it is not under construction or adds it to the target set of the file inode it belongs to. I assume that NN does not persist "targets" of an under construction file.

          Show
          Hairong Kuang added a comment - This sounds good to me. In addition, DN still needs to report NN all blocks in the tmp directory and NN either instructs DN to delete a report tmp block if it is not under construction or adds it to the target set of the file inode it belongs to. I assume that NN does not persist "targets" of an under construction file.
          Hide
          dhruba borthakur added a comment -

          An offline discussion with Hairong resulted in detecting that a special purpose block report (for blocks in blocksBeingWritten directory) needs to be sent by the datanode at start-up time. The Namenode has to process this report specially: it should not insert these blocks into blocksMap, instead it should update the targets of the last blocks for filesUnderConstruction.

          Given the "special" needs of the above, I think it is better if we promote the blocks from "blocksBingWriten" directory to the main directory (after matching/truncating sizes of block files and their crc files) into the main data directory.

          So, I propose that we do the following:
          At start-up, the DN matches the blocks in the "blocksBeingWritten" directory with their meta files. If the size do not match, then the datafile is truncated to match the length described by the CRCs in the metafile. This ensures that this block is likely to be a valid one. Then, these blocks are promoted to the main block directory. (The generation-stamp-protocol will detect inconsistent replicas during lease recovery)

          Show
          dhruba borthakur added a comment - An offline discussion with Hairong resulted in detecting that a special purpose block report (for blocks in blocksBeingWritten directory) needs to be sent by the datanode at start-up time. The Namenode has to process this report specially: it should not insert these blocks into blocksMap, instead it should update the targets of the last blocks for filesUnderConstruction. Given the "special" needs of the above, I think it is better if we promote the blocks from "blocksBingWriten" directory to the main directory (after matching/truncating sizes of block files and their crc files) into the main data directory. So, I propose that we do the following: At start-up, the DN matches the blocks in the "blocksBeingWritten" directory with their meta files. If the size do not match, then the datafile is truncated to match the length described by the CRCs in the metafile. This ensures that this block is likely to be a valid one. Then, these blocks are promoted to the main block directory. (The generation-stamp-protocol will detect inconsistent replicas during lease recovery)
          Hide
          Hairong Kuang added a comment -

          I still feel very uncomfortable about prompting blocks under "blocksBeingWritten". Basically this changes the state of these blocks. DataNode does not know that these blocks are being written any more. In the most recent patch to HADOOP-4692, depending on the state of blocks, block replication takes different behavior. It may wrongly delete on-disk block if its length does not match the NN recorded length if a being written block is prompted to be a permanent block.

          Show
          Hairong Kuang added a comment - I still feel very uncomfortable about prompting blocks under "blocksBeingWritten". Basically this changes the state of these blocks. DataNode does not know that these blocks are being written any more. In the most recent patch to HADOOP-4692 , depending on the state of blocks, block replication takes different behavior. It may wrongly delete on-disk block if its length does not match the NN recorded length if a being written block is prompted to be a permanent block.
          Hide
          Sanjay Radia added a comment -

          I really like Hirong's suggestion ( https://issues.apache.org/jira/browse/HADOOP-4663?focusedCommentId=12668127#action_12668127) to keep the DN tmp blocks as ongoingCreates and send the special BR to the NN. This is symmetric to the inodes-under-construction and lease recovery of the NN.

          The main issue I had with the older approaches was the inconsistency:

          • we added a notion of Tmp because we didn't want to send these block as part of a BR to the NN,
          • but if the DN restarted all block in tmp are moved to main directory and included in the BR anyway.

          Hairong's suggestion keeps the semantics of tmp the same across reboots of DN.
          This is very clean even though it adds additional code and a new "block under cons" BR.
          Furthermore it allows us to verify that these blocks match those under construction on the NN side.
          Our past attempts at sync/append and at fixing this bug have been unsuccessful because I think we were trying to
          be too clever.

          The problem I have with Dhruba's suggestion is that it retains the inconsistency I mention above and somehow appears
          to be trying to avoid the special BR. If on a reboot , the DN
          moves some blocks from tmp to main (after doing the validations Dhruba suggested), why have them in tmp in the first place?

          One could consider not sending this special BR at all ever. This does not work because the blocks in tmp may not
          every get cleaned in some circumstances, For example, if a NN is restarted from
          an older fsimage, the tmp files in the DNs will never be removed.

          So +1 for Hairong's suggestion.

          Show
          Sanjay Radia added a comment - I really like Hirong's suggestion ( https://issues.apache.org/jira/browse/HADOOP-4663?focusedCommentId=12668127#action_12668127 ) to keep the DN tmp blocks as ongoingCreates and send the special BR to the NN. This is symmetric to the inodes-under-construction and lease recovery of the NN. The main issue I had with the older approaches was the inconsistency: we added a notion of Tmp because we didn't want to send these block as part of a BR to the NN, but if the DN restarted all block in tmp are moved to main directory and included in the BR anyway. Hairong's suggestion keeps the semantics of tmp the same across reboots of DN. This is very clean even though it adds additional code and a new "block under cons" BR. Furthermore it allows us to verify that these blocks match those under construction on the NN side. Our past attempts at sync/append and at fixing this bug have been unsuccessful because I think we were trying to be too clever. The problem I have with Dhruba's suggestion is that it retains the inconsistency I mention above and somehow appears to be trying to avoid the special BR. If on a reboot , the DN moves some blocks from tmp to main (after doing the validations Dhruba suggested), why have them in tmp in the first place? One could consider not sending this special BR at all ever. This does not work because the blocks in tmp may not every get cleaned in some circumstances, For example, if a NN is restarted from an older fsimage, the tmp files in the DNs will never be removed. So +1 for Hairong's suggestion.
          Hide
          dhruba borthakur added a comment -

          I have been thinking of Hairong's suggestion as well. I agree that it makes sense for the blocks inside "blocksBeingWritten" directory to not be auto-promoted, but instead make lease recovery promote them on demand. The only problem I had with this approach is that a "special" block report is needed.

          What if we have a block report always have two counters up front? the first counter will list the number of normal blocks in the block report. the second counter will have the number of blocks in the block report that are being picked up from the "blocksBeingWritten" directory? The NN, while processing a block report, will first look at the first counter and process those many blocks from the block report as usual. Then it will look at the second counter and will special-process those many blocks from the block report.

          Show
          dhruba borthakur added a comment - I have been thinking of Hairong's suggestion as well. I agree that it makes sense for the blocks inside "blocksBeingWritten" directory to not be auto-promoted, but instead make lease recovery promote them on demand. The only problem I had with this approach is that a "special" block report is needed. What if we have a block report always have two counters up front? the first counter will list the number of normal blocks in the block report. the second counter will have the number of blocks in the block report that are being picked up from the "blocksBeingWritten" directory? The NN, while processing a block report, will first look at the first counter and process those many blocks from the block report as usual. Then it will look at the second counter and will special-process those many blocks from the block report.
          Hide
          dhruba borthakur added a comment -

          This document appendQuestions.txt attempts to answer some questions related to the "Append" feature for HDFS.

          Show
          dhruba borthakur added a comment - This document appendQuestions.txt attempts to answer some questions related to the "Append" feature for HDFS.
          Hide
          Nigel Daley added a comment -

          As discussed on core-dev@ (http://www.nabble.com/Hadoop-0.19.1-td21739202.html) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.

          Show
          Nigel Daley added a comment - As discussed on core-dev@ ( http://www.nabble.com/Hadoop-0.19.1-td21739202.html ) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.
          Hide
          dhruba borthakur added a comment -

          An offline discussion with Hairong, Sanjay, Rob Chansler, Raghu and Konstantin resulted in these observations. 'bbw" refers to "blocksBeingWritten" directory.

          1. leave blocks in bbw directory even when data restarts. only when the block is finalized (when user closes the block or lease recovery occurs), does the block move to the real block directory.
          2. first block report (following a datanode registration) sends all blocks (including blocks in bbw)
          3. the block report processing on namenode ignores blocks that are under construction
          4. lease recovery should verify crc of block before they get promoted from bbw to real block directory
          5. When lease recovery ocurs, the datanode should terminate writer-threads before returning length of block

          Show
          dhruba borthakur added a comment - An offline discussion with Hairong, Sanjay, Rob Chansler, Raghu and Konstantin resulted in these observations. 'bbw" refers to "blocksBeingWritten" directory. 1. leave blocks in bbw directory even when data restarts. only when the block is finalized (when user closes the block or lease recovery occurs), does the block move to the real block directory. 2. first block report (following a datanode registration) sends all blocks (including blocks in bbw) 3. the block report processing on namenode ignores blocks that are under construction 4. lease recovery should verify crc of block before they get promoted from bbw to real block directory 5. When lease recovery ocurs, the datanode should terminate writer-threads before returning length of block
          Hide
          dhruba borthakur added a comment -

          This patch is needed to make Hbase work correctly on Hadoop 0.20.

          Show
          dhruba borthakur added a comment - This patch is needed to make Hbase work correctly on Hadoop 0.20.
          Hide
          dhruba borthakur added a comment -

          Patch for hadoop 0.20 release.

          Show
          dhruba borthakur added a comment - Patch for hadoop 0.20 release.
          Hide
          dhruba borthakur added a comment -

          This patch passes all unit tests.

          Show
          dhruba borthakur added a comment - This patch passes all unit tests.
          Hide
          stack added a comment -

          Disclaimer: I'm not up on the background to this patch but taking a look anyways.

          Patch looks good. I like the TestAppend4 addition.

          Do you want to leave xxxtestAppendWithReplication and xxxtestAppendSyncHalfBlock in there and the commented out code?

          Should "+ DataNode.LOG.info("XXX createTmpFile failed for file " + f + " Block " + b);" be logged at WARN level?

          Show
          stack added a comment - Disclaimer: I'm not up on the background to this patch but taking a look anyways. Patch looks good. I like the TestAppend4 addition. Do you want to leave xxxtestAppendWithReplication and xxxtestAppendSyncHalfBlock in there and the commented out code? Should "+ DataNode.LOG.info("XXX createTmpFile failed for file " + f + " Block " + b);" be logged at WARN level?
          Hide
          Nicolas Spiegelberg added a comment -

          Although related to HDFS-142, TestAppend4 is an ongoing suite of unit tests we're trying to tailor to the HBase use case. TestAppend4 is currently revealing other bugs, so some of the code is commented out. We're working on fixing the other bugs and will update the Test file when we fix them.

          Show
          Nicolas Spiegelberg added a comment - Although related to HDFS-142 , TestAppend4 is an ongoing suite of unit tests we're trying to tailor to the HBase use case. TestAppend4 is currently revealing other bugs, so some of the code is commented out. We're working on fixing the other bugs and will update the Test file when we fix them.
          Hide
          stack added a comment -

          @Nicolas Sounds good.

          Show
          stack added a comment - @Nicolas Sounds good.
          Hide
          Nicolas Spiegelberg added a comment -

          In many of the TestFileAppend4 tests, we need to sequentially shutdown datanodes in a cluster. As each datanode is killed, the DFSClient sees it's pipeline dying and signals a recoverBlock. T o avoid having the namenode bump the sequence number of an open file on recoverBlock, we need to put it in safe mode. HDFS-988 fixes a number of bugs, including recoverBlock, where the namenode will modify the file even though it is in safemode.

          Show
          Nicolas Spiegelberg added a comment - In many of the TestFileAppend4 tests, we need to sequentially shutdown datanodes in a cluster. As each datanode is killed, the DFSClient sees it's pipeline dying and signals a recoverBlock. T o avoid having the namenode bump the sequence number of an open file on recoverBlock, we need to put it in safe mode. HDFS-988 fixes a number of bugs, including recoverBlock, where the namenode will modify the file even though it is in safemode.
          Hide
          Nicolas Spiegelberg added a comment -

          Add checksum check of last chunk on restart. Added TestFileAppend4 to test HDFS-200 + HDFS-142. Passes all tests.

          Show
          Nicolas Spiegelberg added a comment - Add checksum check of last chunk on restart. Added TestFileAppend4 to test HDFS-200 + HDFS-142 . Passes all tests.
          Hide
          Todd Lipcon added a comment -

          Hi Nicolas. I needed to add this fix to MiniDFSCluster (backported from part of HDFS-409) for the tests to pass. Otherwise, sometimes the MiniDFS cluster wouldn't have received heartbeats from all of its DNs, and the replication factor would be too low.

          Show
          Todd Lipcon added a comment - Hi Nicolas. I needed to add this fix to MiniDFSCluster (backported from part of HDFS-409 ) for the tests to pass. Otherwise, sometimes the MiniDFS cluster wouldn't have received heartbeats from all of its DNs, and the replication factor would be too low.
          Hide
          Karthik Ranganathan added a comment -

          Hey guys,

          If we had 2 or more files in the blocks being written directory, the data node would not be able to start up - because the code tries to add the BlockAndFile objects to a TreeSet internally, but the block and file object does not implement a comparable. The first addition goes through as the TreeSet does not have to compare anything. The data node dies on restart with the following exception:

          2010-03-23 15:50:23,152 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.lang.ClassCastException: org.apache.hadoop.hdfs.server.datanode.FSDataset$BlockAndFile cannot be cast to java.lang.Comparable
          at java.util.TreeMap.put(TreeMap.java:542)
          at java.util.TreeSet.add(TreeSet.java:238)
          at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.getBlockAndFileInfo(FSDataset.java:247)
          at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.recoverBlocksBeingWritten(FSDataset.java:539)
          at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.<init>(FSDataset.java:381)
          at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:895)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:305)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:219)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1337)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1292)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1300)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1422)

          This patch makes the BlockAndFile class implement Comparable, and a unit test (thanks Nick) that verifies this case.

          Show
          Karthik Ranganathan added a comment - Hey guys, If we had 2 or more files in the blocks being written directory, the data node would not be able to start up - because the code tries to add the BlockAndFile objects to a TreeSet internally, but the block and file object does not implement a comparable. The first addition goes through as the TreeSet does not have to compare anything. The data node dies on restart with the following exception: 2010-03-23 15:50:23,152 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.lang.ClassCastException: org.apache.hadoop.hdfs.server.datanode.FSDataset$BlockAndFile cannot be cast to java.lang.Comparable at java.util.TreeMap.put(TreeMap.java:542) at java.util.TreeSet.add(TreeSet.java:238) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.getBlockAndFileInfo(FSDataset.java:247) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.recoverBlocksBeingWritten(FSDataset.java:539) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.<init>(FSDataset.java:381) at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:895) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:305) at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:219) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1337) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1292) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1300) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1422) This patch makes the BlockAndFile class implement Comparable, and a unit test (thanks Nick) that verifies this case.
          Hide
          Todd Lipcon added a comment -

          I found a bug in the append code where it doesn't work properly with the following sequence:

          • open a file for write
          • write some data
          • close it
          • the DN with the lowest name dies, but not yet marked dead on the NN
          • a client calls append() to try to recover the lease (not knowing that the file isn't currently under construction)

          In this case, the client ends up thinking it has opened the file for append, and there's a new lease on the NN side, but on the client side it's in an error state where close() will throw IOE (and not close the new lease).

          Attaching a new case for TestFileAppend4 for this situation.

          Show
          Todd Lipcon added a comment - I found a bug in the append code where it doesn't work properly with the following sequence: open a file for write write some data close it the DN with the lowest name dies, but not yet marked dead on the NN a client calls append() to try to recover the lease (not knowing that the file isn't currently under construction) In this case, the client ends up thinking it has opened the file for append, and there's a new lease on the NN side, but on the client side it's in an error state where close() will throw IOE (and not close the new lease). Attaching a new case for TestFileAppend4 for this situation.
          Hide
          Todd Lipcon added a comment -

          Renaming JIRA to reflect the actual scope of this issue in the branch-20 sync work

          Show
          Todd Lipcon added a comment - Renaming JIRA to reflect the actual scope of this issue in the branch-20 sync work
          Hide
          Nicolas Spiegelberg added a comment -

          Added patch to fix Todd's deaddn problem. The main problem: DFSOutputStream called processDatanodeError() but then ignored the return value. This meant that any slew of pipeline creation exceptions would be ignored and the client would think that append() passed. Good catch!

          Show
          Nicolas Spiegelberg added a comment - Added patch to fix Todd's deaddn problem. The main problem: DFSOutputStream called processDatanodeError() but then ignored the return value. This meant that any slew of pipeline creation exceptions would be ignored and the client would think that append() passed. Good catch!
          Hide
          sam rash added a comment -

          fixes issue with lease recovery failing on blocks that are already finalized

          Show
          sam rash added a comment - fixes issue with lease recovery failing on blocks that are already finalized
          Hide
          Todd Lipcon added a comment -

          Uploading two more patches for 0.20 append:

          • hdfs-142-commitBlockSynchronization-unknown-datanode.txt fixes a case where FSN.getDatanode was throwing an UnregisteredDatanodeException since one of the original recovery targets had departed the cluster (in this case been replaced by a new DN with the same storage but a different port). This exception was causing the commitBlockSynchronization to fail after removing the old block from blocksMap but before putting in the new one, making both old and new blocks inaccessible, and causing any further nextGenerationStamp calls to fail.
          • hdfs-142-testcases.txt includes two new test cases:
            • testRecoverFinalizedBlock stops a writer just before it calls completeFile() and then has another client recover the file
            • testDatanodeFailsToCommit() injects an IOE when the DN calls commitBlockSynchronization for the first time, to make sure that the retry succeeds even though updateBlocks() was already called during the first synchronization attempt.
            • These tests pass after applying Sam's patch to fix refinalization of a finalized block.
          Show
          Todd Lipcon added a comment - Uploading two more patches for 0.20 append: hdfs-142-commitBlockSynchronization-unknown-datanode.txt fixes a case where FSN.getDatanode was throwing an UnregisteredDatanodeException since one of the original recovery targets had departed the cluster (in this case been replaced by a new DN with the same storage but a different port). This exception was causing the commitBlockSynchronization to fail after removing the old block from blocksMap but before putting in the new one, making both old and new blocks inaccessible, and causing any further nextGenerationStamp calls to fail. hdfs-142-testcases.txt includes two new test cases: testRecoverFinalizedBlock stops a writer just before it calls completeFile() and then has another client recover the file testDatanodeFailsToCommit() injects an IOE when the DN calls commitBlockSynchronization for the first time, to make sure that the retry succeeds even though updateBlocks() was already called during the first synchronization attempt. These tests pass after applying Sam's patch to fix refinalization of a finalized block.
          Hide
          Todd Lipcon added a comment -

          Attaching a patch with two more fixes:

          • If a block is received that is a part of a file that no longer exists, remove it.
            This prevents blocks from getting orphaned in the blocksBeingWritten directory forever
          • File recovery happens after reassigning lease to an NN_Recovery client
            This also includes safeguards and tests to ensure that straggling commitBlockSynchronization
            calls cannot incorrectly overwrite the last block of a file with an old generation stamp
            or a different block ID.
          Show
          Todd Lipcon added a comment - Attaching a patch with two more fixes: If a block is received that is a part of a file that no longer exists, remove it. This prevents blocks from getting orphaned in the blocksBeingWritten directory forever File recovery happens after reassigning lease to an NN_Recovery client This also includes safeguards and tests to ensure that straggling commitBlockSynchronization calls cannot incorrectly overwrite the last block of a file with an old generation stamp or a different block ID.
          Hide
          Todd Lipcon added a comment -

          The last patch seems to break TestLeaseRecovery, since that test abuses updateBlock() to truncate a block for testing purposes. I'll post an update soon.

          Show
          Todd Lipcon added a comment - The last patch seems to break TestLeaseRecovery, since that test abuses updateBlock() to truncate a block for testing purposes. I'll post an update soon.
          Hide
          Todd Lipcon added a comment -

          Here's a fix for TestLeaseRecovery so that it passes even with the new safeguards in FSDataset.updateBlock

          Show
          Todd Lipcon added a comment - Here's a fix for TestLeaseRecovery so that it passes even with the new safeguards in FSDataset.updateBlock
          Hide
          sam rash added a comment -

          todd : one question about the latest 2 patches. Is clientName access from multiple threads via setClientName/getClientName ?
          if so, shouldn't it be volatile?

          (I think that object, INodeFileUnderConstruction can be accessed from other threads, in particular if the NN starts recovery, sets it, and another RPC thread tries to handle a close or sync call)

          Show
          sam rash added a comment - todd : one question about the latest 2 patches. Is clientName access from multiple threads via setClientName/getClientName ? if so, shouldn't it be volatile? (I think that object, INodeFileUnderConstruction can be accessed from other threads, in particular if the NN starts recovery, sets it, and another RPC thread tries to handle a close or sync call)
          Hide
          Todd Lipcon added a comment -

          It's accessed from multiple threads, but those threads are always synchronized on the FSNamesystem lock anyway - none of the lease-related stuff has any actual concurrency. Marking it volatile wouldn't cause any harm, but I think we assume pretty much everywhere that the leases aren't mucked with without holding the coarse grain lock.

          BTW, on an unrelated note, I found that TestNodeCount times out due to the MiniDFSCluster changes imported by HDFS-409 - will upload a fix for that here soon.

          Show
          Todd Lipcon added a comment - It's accessed from multiple threads, but those threads are always synchronized on the FSNamesystem lock anyway - none of the lease-related stuff has any actual concurrency. Marking it volatile wouldn't cause any harm, but I think we assume pretty much everywhere that the leases aren't mucked with without holding the coarse grain lock. BTW, on an unrelated note, I found that TestNodeCount times out due to the MiniDFSCluster changes imported by HDFS-409 - will upload a fix for that here soon.
          Hide
          Todd Lipcon added a comment -

          Posted a patch to HDFS-606 which is important to not lose replicas (TestFileAppend2 was failing occasionally for me with this CME)

          Show
          Todd Lipcon added a comment - Posted a patch to HDFS-606 which is important to not lose replicas (TestFileAppend2 was failing occasionally for me with this CME)
          Hide
          Todd Lipcon added a comment -

          Had a test failure of TestFileAppend2 today with:

          [junit] 2010-05-12 12:20:46,249 WARN protocol.InterDatanodeProtocol (DataNode.java:recoverBlock(1537)) - Failed to getBlockMetaDataInfo for block (=blk_7206139570868165957_1054) from datanode (=127.0.0.1:42179)
          [junit] java.io.IOException: Block blk_7206139570868165957_1054 does not exist in volumeMap.
          [junit] at org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1250)
          [junit] at org.apache.hadoop.hdfs.server.datanode.DataNode.getBlockMetaDataInfo(DataNode.java:1425)
          [junit] at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1521)
          [junit] at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1616)

          This failure was actually on our vanilla 0.20 Hudson, not on the append branch.

          In investigating this I noticed that validateBlockMetadata is not marked synchronized in FSDataset, and thus accesses the volumeMap HashMap in an unsynchronized matter. If this races with eg a rehash of the hashmap, it can give false non-existence.

          Doesn't seem to be a problem in trunk append (this function is gone)

          Show
          Todd Lipcon added a comment - Had a test failure of TestFileAppend2 today with: [junit] 2010-05-12 12:20:46,249 WARN protocol.InterDatanodeProtocol (DataNode.java:recoverBlock(1537)) - Failed to getBlockMetaDataInfo for block (=blk_7206139570868165957_1054) from datanode (=127.0.0.1:42179) [junit] java.io.IOException: Block blk_7206139570868165957_1054 does not exist in volumeMap. [junit] at org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1250) [junit] at org.apache.hadoop.hdfs.server.datanode.DataNode.getBlockMetaDataInfo(DataNode.java:1425) [junit] at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1521) [junit] at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1616) This failure was actually on our vanilla 0.20 Hudson, not on the append branch. In investigating this I noticed that validateBlockMetadata is not marked synchronized in FSDataset, and thus accesses the volumeMap HashMap in an unsynchronized matter. If this races with eg a rehash of the hashmap, it can give false non-existence. Doesn't seem to be a problem in trunk append (this function is gone)
          Hide
          sam rash added a comment -

          this reminds me--in the test I posted on hdfs-1057, I had two tests that fail doing concurrent reads and appends. The tests actually fail w/o the concurrent read patch. I haven't had time to look into it. You can run the tests in TestFileConcurrentReader. I can file a sep jira for this as well, but it seems append-related.

          // fails due to issue w/append, disable
          public void _testUnfinishedBlockCRCErrorTransferToAppend() throws IOException

          { runTestUnfinishedBlockCRCError(true, SyncType.APPEND, DEFAULT_WRITE_SIZE); }

          // fails due to issue w/append, disable
          public void _testUnfinishedBlockCRCErrorNormalTransferAppend()
          throws IOException

          { runTestUnfinishedBlockCRCError(false, SyncType.APPEND, DEFAULT_WRITE_SIZE); }
          Show
          sam rash added a comment - this reminds me--in the test I posted on hdfs-1057, I had two tests that fail doing concurrent reads and appends. The tests actually fail w/o the concurrent read patch. I haven't had time to look into it. You can run the tests in TestFileConcurrentReader. I can file a sep jira for this as well, but it seems append-related. // fails due to issue w/append, disable public void _testUnfinishedBlockCRCErrorTransferToAppend() throws IOException { runTestUnfinishedBlockCRCError(true, SyncType.APPEND, DEFAULT_WRITE_SIZE); } // fails due to issue w/append, disable public void _testUnfinishedBlockCRCErrorNormalTransferAppend() throws IOException { runTestUnfinishedBlockCRCError(false, SyncType.APPEND, DEFAULT_WRITE_SIZE); }
          Hide
          Todd Lipcon added a comment -

          Small fix for another test failure exposed by TestFileAppend2.testComplexAppend (when run with java assertions enabled). When we removed blocks from recentInvalidateSets, we didn't remove the collections when they became empty, which triggered an assertion at the top of DatanodeDescriptor.addBlocksToBeInvalidated

          (this is not an issue in trunk, trunk has this same fix)

          Show
          Todd Lipcon added a comment - Small fix for another test failure exposed by TestFileAppend2.testComplexAppend (when run with java assertions enabled). When we removed blocks from recentInvalidateSets, we didn't remove the collections when they became empty, which triggered an assertion at the top of DatanodeDescriptor.addBlocksToBeInvalidated (this is not an issue in trunk, trunk has this same fix)
          Hide
          Todd Lipcon added a comment -

          appendFile() is made up of two synchronized blocks, and there is no re-check of the file existence (or lease) when entering the second one. Uploading a test case and fix.

          Show
          Todd Lipcon added a comment - appendFile() is made up of two synchronized blocks, and there is no re-check of the file existence (or lease) when entering the second one. Uploading a test case and fix.
          Hide
          Todd Lipcon added a comment -

          Attached patch treats replicas recovered during DN startup as possibly truncated, and thus recovers those replicas from still-running DNs only if such replicas are available. (included test case explains this better)

          Show
          Todd Lipcon added a comment - Attached patch treats replicas recovered during DN startup as possibly truncated, and thus recovers those replicas from still-running DNs only if such replicas are available. (included test case explains this better)
          Hide
          Todd Lipcon added a comment -

          New version of the dont-recover-rbw patch (this is what I've been testing against)

          Show
          Todd Lipcon added a comment - New version of the dont-recover-rbw patch (this is what I've been testing against)
          Hide
          Nicolas Spiegelberg added a comment -

          TestFileAppend4 tests depend on a number of JIRAs already being applied. Explicitly specifying them so it's easier to add this to 0.20-append branch.

          HDFS-826 : Uses to determine which DNs to kill and which are still valid.
          HDFS-101 : For when we try killing DNs further in the pipeline
          HDFS-793 : Because we verify data size after all DNs have sent an ACK
          HDFS-988 : Fix safemode. Before, only the last DN has the highest seqnum on test cluster restart.

          Show
          Nicolas Spiegelberg added a comment - TestFileAppend4 tests depend on a number of JIRAs already being applied. Explicitly specifying them so it's easier to add this to 0.20-append branch. HDFS-826 : Uses to determine which DNs to kill and which are still valid. HDFS-101 : For when we try killing DNs further in the pipeline HDFS-793 : Because we verify data size after all DNs have sent an ACK HDFS-988 : Fix safemode. Before, only the last DN has the highest seqnum on test cluster restart.
          Hide
          Nicolas Spiegelberg added a comment -

          Patch for 0.20-append branch. Starts with HDFS-142_20.patch and includes all patches up to appendFile-recheck-lease.txt. Had trouble adding the relatively-new recover-rbw-v2.txt, so left that for Todd. Assumes that the 0.20-append patches in HDFS-826, HDFS-988, & HDFS-101 have been applied previously.

          Show
          Nicolas Spiegelberg added a comment - Patch for 0.20-append branch. Starts with HDFS-142 _20.patch and includes all patches up to appendFile-recheck-lease.txt. Had trouble adding the relatively-new recover-rbw-v2.txt, so left that for Todd. Assumes that the 0.20-append patches in HDFS-826 , HDFS-988 , & HDFS-101 have been applied previously.
          Hide
          dhruba borthakur added a comment -

          I have committed this. Thanks Sam, Nicolas and Todd.

          Show
          dhruba borthakur added a comment - I have committed this. Thanks Sam, Nicolas and Todd.
          Hide
          Jitendra Nath Pandey added a comment -

          Patch for 20-security branch uploaded.

          Show
          Jitendra Nath Pandey added a comment - Patch for 20-security branch uploaded.
          Hide
          Suresh Srinivas added a comment -

          Can you please add a banner to TestFileAppend4.java

          Show
          Suresh Srinivas added a comment - Can you please add a banner to TestFileAppend4.java
          Hide
          Jitendra Nath Pandey added a comment -

          Added Apache License header.

          Show
          Jitendra Nath Pandey added a comment - Added Apache License header.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Suresh Srinivas added a comment -

          I committed the patch 0.20-security

          Show
          Suresh Srinivas added a comment - I committed the patch 0.20-security
          Hide
          Matt Foley added a comment -

          Closed upon release of 0.20.205.0

          Show
          Matt Foley added a comment - Closed upon release of 0.20.205.0

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development