Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8193

Add the ability to delay replica deletion for a period of time

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:

      Description

      When doing maintenance on an HDFS cluster, users may be concerned about the possibility of administrative mistakes or software bugs deleting replicas of blocks that cannot easily be restored. It would be handy if HDFS could be made to optionally not delete any replicas for a configurable period of time.

        Issue Links

          Activity

          Hide
          cnauroth Chris Nauroth added a comment -

          Hello Aaron T. Myers and Zhe Zhang. HDFS-6186 added the dfs.namenode.startup.delay.block.deletion.ms configuration property. Can you please describe how this new feature is different from that, and if applicable, how the two are related? HDFS-6186 only applies at NameNode startup. Is the new feature something that could be triggered at any time on a running NameNode, such as right before a manual HA failover?

          Show
          cnauroth Chris Nauroth added a comment - Hello Aaron T. Myers and Zhe Zhang . HDFS-6186 added the dfs.namenode.startup.delay.block.deletion.ms configuration property. Can you please describe how this new feature is different from that, and if applicable, how the two are related? HDFS-6186 only applies at NameNode startup. Is the new feature something that could be triggered at any time on a running NameNode, such as right before a manual HA failover?
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Chris for bringing up the questions.

          HDFS-6186 only applies at NameNode startup. Is the new feature something that could be triggered at any time on a running NameNode, such as right before a manual HA failover?

          Short answer is yes. One can imagine it as a "trash" for block replicas, fully controlled by the DN hosting them. This should shelter block replicas from most admin mis-operations and NN bugs (more likely than DN bugs given the complexity) for a period of time.

          To answer the question from Suresh Srinivas under HDFS-6186:

          One problem with not deleting the blocks for a deleted file is, how does one restore it? Can we address in this jira pausing deletion after startup and address the suggestion you have made, along with other changes that might be necessary, in another jira.

          First, NN bugs could cause block replicas to be deleted without deleting the file. Second, it's rather easy to back up NN metadata before performing maintenance, but extremely difficult to back up actual DN data. This JIRA aims to address that deficiency / discrepancy.

          As future work, we plan to investigate an even more radical retention policy, where block replicas are never deleted before DN is actually running out of space. At that moment, victims are selected among pending-deletion replicas using a smart algorithm, and are overwritten by incoming replicas. We'll file a separate JIRA for that, after this JIRA builds the basic DN-side replica retention machinery.

          Show
          zhz Zhe Zhang added a comment - Thanks Chris for bringing up the questions. HDFS-6186 only applies at NameNode startup. Is the new feature something that could be triggered at any time on a running NameNode, such as right before a manual HA failover? Short answer is yes. One can imagine it as a "trash" for block replicas, fully controlled by the DN hosting them. This should shelter block replicas from most admin mis-operations and NN bugs (more likely than DN bugs given the complexity) for a period of time. To answer the question from Suresh Srinivas under HDFS-6186 : One problem with not deleting the blocks for a deleted file is, how does one restore it? Can we address in this jira pausing deletion after startup and address the suggestion you have made, along with other changes that might be necessary, in another jira. First, NN bugs could cause block replicas to be deleted without deleting the file. Second, it's rather easy to back up NN metadata before performing maintenance, but extremely difficult to back up actual DN data. This JIRA aims to address that deficiency / discrepancy. As future work, we plan to investigate an even more radical retention policy, where block replicas are never deleted before DN is actually running out of space. At that moment, victims are selected among pending-deletion replicas using a smart algorithm, and are overwritten by incoming replicas. We'll file a separate JIRA for that, after this JIRA builds the basic DN-side replica retention machinery.
          Hide
          cnauroth Chris Nauroth added a comment -

          Thank you for the response. That clarifies it for me.

          If possible, would you please see if there is a way to make the delay visible through metrics and the web UI? Perhaps you could even just populate the same fields that were added in HDFS-5986 and HDFS-6385.

          Show
          cnauroth Chris Nauroth added a comment - Thank you for the response. That clarifies it for me. If possible, would you please see if there is a way to make the delay visible through metrics and the web UI? Perhaps you could even just populate the same fields that were added in HDFS-5986 and HDFS-6385 .
          Hide
          zhz Zhe Zhang added a comment -

          If possible, would you please see if there is a way to make the delay visible through metrics and the web UI?

          That's a great point. I believe admins will want to monitor both the delay and number of pending deletions. Will either add in this JIRA or a follow-on.

          Perhaps you could even just populate the same fields that were added in HDFS-5986 and HDFS-6385.

          Seems to me these metrics differ for each DN. Maybe we should add them to the DN web UI / metrics? We could sum up the number of pending-deletion replicas and show on NN. But the per-DN delays are hard to summarize.

          Show
          zhz Zhe Zhang added a comment - If possible, would you please see if there is a way to make the delay visible through metrics and the web UI? That's a great point. I believe admins will want to monitor both the delay and number of pending deletions. Will either add in this JIRA or a follow-on. Perhaps you could even just populate the same fields that were added in HDFS-5986 and HDFS-6385 . Seems to me these metrics differ for each DN. Maybe we should add them to the DN web UI / metrics? We could sum up the number of pending-deletion replicas and show on NN. But the per-DN delays are hard to summarize.
          Hide
          cnauroth Chris Nauroth added a comment -

          Seems to me these metrics differ for each DN.

          Ah yes, I missed the point that you were aiming for per-DN granularity. In that case, yes, DN metrics would make sense. You also could potentially take the approach done in HDFS-7604 to publish the counters back to the NN in heartbeats, and that would enable the NameNode to display per-DN stats on the Datanodes tab. It's probably worth doing a quick UI mock-up to check if that really makes sense though. Those tables can get crowded quickly.

          Thanks again.

          Show
          cnauroth Chris Nauroth added a comment - Seems to me these metrics differ for each DN. Ah yes, I missed the point that you were aiming for per-DN granularity. In that case, yes, DN metrics would make sense. You also could potentially take the approach done in HDFS-7604 to publish the counters back to the NN in heartbeats, and that would enable the NameNode to display per-DN stats on the Datanodes tab. It's probably worth doing a quick UI mock-up to check if that really makes sense though. Those tables can get crowded quickly. Thanks again.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks for the pointers Chris! A mock-up is a very good idea; HDFS-5986 and HDFS-6385 are good examples to follow.

          Show
          zhz Zhe Zhang added a comment - Thanks for the pointers Chris! A mock-up is a very good idea; HDFS-5986 and HDFS-6385 are good examples to follow.
          Hide
          sureshms Suresh Srinivas added a comment -

          Zhe Zhang, I am not clear on what use case this is solving.
          We now have a mechanism to delay block deletion after namenode startup. This is precisely targeting the issues of administrator copying wrong fsimage (older fsimage) which could result in deletion of blocks and loss of data.

          First, NN bugs could cause block replicas to be deleted without deleting the file. Second, it's rather easy to back up NN metadata before performing maintenance, but extremely difficult to back up actual DN data. This JIRA aims to address that deficiency / discrepancy.

          Second use case, NN deleted file and admin wants to restore it (the case of NN metadata backup). Going back to an older fsimage is not that straight forward and a solution to be used only in desperate situation. It can cause corruption for other applications running on HDFS. It also results in loss of newly created data across the file system. Snapshots and trash are solutions for this.

          Second use case, NN deletes blocks without deleting files. Have you seen an instance of this? It would be great to get one pager on how one handles this condition. Does NN keep deleting the blocks until it is hot fixed? Also completing deletion of blocks in a timely manner is important for a running cluster. All files don't require the same reliability. Intermediate data and tmp files need to be deleted immediately to free up cluster storage to avoid the risk of running out of storage space. At datanode level, there is no notion of whether files are temporary or important ones that need to be preserved. So a trash such as this can result in retaining lot of tmp files and deletes not being able to free up storage with in the cluster fast enough.

          Can you please talk about any other administrative mistakes that you are targeting with this functionality?

          Show
          sureshms Suresh Srinivas added a comment - Zhe Zhang , I am not clear on what use case this is solving. We now have a mechanism to delay block deletion after namenode startup. This is precisely targeting the issues of administrator copying wrong fsimage (older fsimage) which could result in deletion of blocks and loss of data. First, NN bugs could cause block replicas to be deleted without deleting the file. Second, it's rather easy to back up NN metadata before performing maintenance, but extremely difficult to back up actual DN data. This JIRA aims to address that deficiency / discrepancy. Second use case, NN deleted file and admin wants to restore it (the case of NN metadata backup). Going back to an older fsimage is not that straight forward and a solution to be used only in desperate situation. It can cause corruption for other applications running on HDFS. It also results in loss of newly created data across the file system. Snapshots and trash are solutions for this. Second use case, NN deletes blocks without deleting files. Have you seen an instance of this? It would be great to get one pager on how one handles this condition. Does NN keep deleting the blocks until it is hot fixed? Also completing deletion of blocks in a timely manner is important for a running cluster. All files don't require the same reliability. Intermediate data and tmp files need to be deleted immediately to free up cluster storage to avoid the risk of running out of storage space. At datanode level, there is no notion of whether files are temporary or important ones that need to be preserved. So a trash such as this can result in retaining lot of tmp files and deletes not being able to free up storage with in the cluster fast enough. Can you please talk about any other administrative mistakes that you are targeting with this functionality?
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Suresh Srinivas for the helpful comments!

          Second use case, NN deleted file and admin wants to restore it (the case of NN metadata backup). Going back to an older fsimage is not that straight forward and a solution to be used only in desperate situation. It can cause corruption for other applications running on HDFS. It also results in loss of newly created data across the file system. Snapshots and trash are solutions for this.

          You are absolutely right that it's always preferable to protect data on the file instead of block level. This JIRA indeed is aimed as the last resort for desperate situations. It's similar to recovering data directly from hard disk drives when the file system is corrupt beyond recovery. It's fully controlled by the DN and is the last layer of protection when all layers above have failed (trash mistakenly emptied, snapshots not correctly setup, etc.).

          First use case, NN deletes blocks without deleting files. Have you seen an instance of this? It would be great to get one pager on how one handles this condition.

          One possible situation (recently fixed by HDFS-7960) is that NN mistakenly considers some blocks as over replicated, caused by zombie storages. Even though HDFS-7960 is already fixed, we should do something to protect against possible future NN bugs. This is the crux of why file-level protections, although always desirable, are not always sufficient. It could be that the NN gets something wrong, and then we're left with irrecoverable data loss.

          Does NN keep deleting the blocks until it is hot fixed?

          In the above case, NN will delete all replicas it considers over replicated until hot fixed.

          Also completing deletion of blocks in a timely manner is important for a running cluster.

          Yes this is a valid concern. Empirically, most customer clusters do not run even close to near disk capacity. Therefore, adding a reasonable grace period shouldn't delay allocating new blocks. The configured delay window should also be enforced under the constraint of available space (e.g., don't delay deletion when available disk space < 10%). We will also add Web UI and metrics support to clearly show the space consumption by deletion-delayed replicas.

          All files don't require the same reliability. Intermediate data and tmp files need to be deleted immediately to free up cluster storage to avoid the risk of running out of storage space. At datanode level, there is no notion of whether files are temporary or important ones that need to be preserved. So a trash such as this can result in retaining lot of tmp files and deletes not being able to free up storage with in the cluster fast enough.

          This is a great point. The proposed work (at least in the first phase) is intended as a best-effort optimization and will always yield to foreground workloads. The target is to statistically reduce the chance and severity of data losses given typical storage consumption conditions. It's certainly still possible for wave of tmp data to flush out more important data in DN trashes. We can design some smart eviction algorithms as future work.

          As I commented above, we are considering a more radical approach as a potential next phase of this work, where deletion-delayed replicas will just be overwritten by incoming replicas. In that case we might not even need to count deletion-delayed replicas in the space quota, making the feature more transparent to admins.

          Show
          zhz Zhe Zhang added a comment - Thanks Suresh Srinivas for the helpful comments! Second use case, NN deleted file and admin wants to restore it (the case of NN metadata backup). Going back to an older fsimage is not that straight forward and a solution to be used only in desperate situation. It can cause corruption for other applications running on HDFS. It also results in loss of newly created data across the file system. Snapshots and trash are solutions for this. You are absolutely right that it's always preferable to protect data on the file instead of block level. This JIRA indeed is aimed as the last resort for desperate situations. It's similar to recovering data directly from hard disk drives when the file system is corrupt beyond recovery. It's fully controlled by the DN and is the last layer of protection when all layers above have failed (trash mistakenly emptied, snapshots not correctly setup, etc.). First use case, NN deletes blocks without deleting files. Have you seen an instance of this? It would be great to get one pager on how one handles this condition. One possible situation (recently fixed by HDFS-7960 ) is that NN mistakenly considers some blocks as over replicated, caused by zombie storages. Even though HDFS-7960 is already fixed, we should do something to protect against possible future NN bugs. This is the crux of why file-level protections, although always desirable, are not always sufficient. It could be that the NN gets something wrong, and then we're left with irrecoverable data loss. Does NN keep deleting the blocks until it is hot fixed? In the above case, NN will delete all replicas it considers over replicated until hot fixed. Also completing deletion of blocks in a timely manner is important for a running cluster. Yes this is a valid concern. Empirically, most customer clusters do not run even close to near disk capacity. Therefore, adding a reasonable grace period shouldn't delay allocating new blocks. The configured delay window should also be enforced under the constraint of available space (e.g., don't delay deletion when available disk space < 10%). We will also add Web UI and metrics support to clearly show the space consumption by deletion-delayed replicas. All files don't require the same reliability. Intermediate data and tmp files need to be deleted immediately to free up cluster storage to avoid the risk of running out of storage space. At datanode level, there is no notion of whether files are temporary or important ones that need to be preserved. So a trash such as this can result in retaining lot of tmp files and deletes not being able to free up storage with in the cluster fast enough. This is a great point. The proposed work (at least in the first phase) is intended as a best-effort optimization and will always yield to foreground workloads. The target is to statistically reduce the chance and severity of data losses given typical storage consumption conditions. It's certainly still possible for wave of tmp data to flush out more important data in DN trashes. We can design some smart eviction algorithms as future work. As I commented above, we are considering a more radical approach as a potential next phase of this work, where deletion-delayed replicas will just be overwritten by incoming replicas. In that case we might not even need to count deletion-delayed replicas in the space quota, making the feature more transparent to admins.
          Hide
          zhz Zhe Zhang added a comment -

          Can you please talk about any other administrative mistakes that you are targeting with this functionality?

          Sorry forgot to address this question. One common admin mistake is simply careless file deletion. The current HDFS Trash should protect this, but empirically this is often noticed from sudden drop of block counts or total space usage, which the admin team usually closely monitors (in contrast, file count is typically less watched).

          Show
          zhz Zhe Zhang added a comment - Can you please talk about any other administrative mistakes that you are targeting with this functionality? Sorry forgot to address this question. One common admin mistake is simply careless file deletion. The current HDFS Trash should protect this, but empirically this is often noticed from sudden drop of block counts or total space usage, which the admin team usually closely monitors (in contrast, file count is typically less watched).
          Hide
          sureshms Suresh Srinivas added a comment -

          Empirically, most customer clusters do not run even close to near disk capacity

          You would be surprised what you find in the field. I have seen many customers running at 90% plus and scrambling to find unnecessary files and deleting them to free up space.

          The configured delay window should also be enforced under the constraint of available space (e.g., don't delay deletion when available disk space < 10%)

          The problem with this approach is, the protection mechanism that is intended works or does not work depending on space and many other factors we may add in the future. That means, really when this feature is needed, the data may not be there. The approaches you are talking about overwriting delayed deletion replicas will run into the same set of issues.

          A user would need more consistent behavior than that.

          How does one restore the blocks or expedite deletion of blocks to free up the storage?

          Show
          sureshms Suresh Srinivas added a comment - Empirically, most customer clusters do not run even close to near disk capacity You would be surprised what you find in the field. I have seen many customers running at 90% plus and scrambling to find unnecessary files and deleting them to free up space. The configured delay window should also be enforced under the constraint of available space (e.g., don't delay deletion when available disk space < 10%) The problem with this approach is, the protection mechanism that is intended works or does not work depending on space and many other factors we may add in the future. That means, really when this feature is needed, the data may not be there. The approaches you are talking about overwriting delayed deletion replicas will run into the same set of issues. A user would need more consistent behavior than that. How does one restore the blocks or expedite deletion of blocks to free up the storage?
          Hide
          zhz Zhe Zhang added a comment -

          I have seen many customers running at 90% plus and scrambling to find unnecessary files and deleting them to free up space.

          Thanks for the insights Suresh! I agree cases like this are fundamentally hard to handle with trash-based safety methods (this is also a good motivation for erasure coding ). I think an empirical study of production clusters should help decide the effectiveness of this proposed feature. I will try to collect some simple usage data first.

          Show
          zhz Zhe Zhang added a comment - I have seen many customers running at 90% plus and scrambling to find unnecessary files and deleting them to free up space. Thanks for the insights Suresh! I agree cases like this are fundamentally hard to handle with trash-based safety methods (this is also a good motivation for erasure coding ). I think an empirical study of production clusters should help decide the effectiveness of this proposed feature. I will try to collect some simple usage data first.
          Hide
          lpstudy lpstudy added a comment -

          Thanks for your great job about making erasure code native in HDFS.
          I am working on proactive data protection in HDFS by incorporating hard drive failure detection method based on collected SMART attributes into HDFS kernel and scheduling disk warning process in advance and want to have erasure code native supported by HDFS kernel instead of HDFS-RAID.

          I have some questions below, but I don't know how to consult them , so I just list my questions here and hope it won't bother you so much.

          1, I am wonderring whether and where i can download the project source code you are working on.
          2, When this project will be accomplished, will it take a long time ?
          3, Whether guys like me can join your group?

          Show
          lpstudy lpstudy added a comment - Thanks for your great job about making erasure code native in HDFS. I am working on proactive data protection in HDFS by incorporating hard drive failure detection method based on collected SMART attributes into HDFS kernel and scheduling disk warning process in advance and want to have erasure code native supported by HDFS kernel instead of HDFS-RAID. I have some questions below, but I don't know how to consult them , so I just list my questions here and hope it won't bother you so much. 1, I am wonderring whether and where i can download the project source code you are working on. 2, When this project will be accomplished, will it take a long time ? 3, Whether guys like me can join your group?
          Hide
          zhz Zhe Zhang added a comment -

          lpstudy Thanks for the interest! I'll copy your comments over to HDFS-7285 and we can continue the discussion there.

          Show
          zhz Zhe Zhang added a comment - lpstudy Thanks for the interest! I'll copy your comments over to HDFS-7285 and we can continue the discussion there.
          Hide
          zhz Zhe Zhang added a comment -

          HDFS-8486 is another use case for the proposed block-level protection.

          Show
          zhz Zhe Zhang added a comment - HDFS-8486 is another use case for the proposed block-level protection.

            People

            • Assignee:
              zhz Zhe Zhang
              Reporter:
              atm Aaron T. Myers
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:

                Development