Thanks Suresh Srinivas for the helpful comments!
Second use case, NN deleted file and admin wants to restore it (the case of NN metadata backup). Going back to an older fsimage is not that straight forward and a solution to be used only in desperate situation. It can cause corruption for other applications running on HDFS. It also results in loss of newly created data across the file system. Snapshots and trash are solutions for this.
You are absolutely right that it's always preferable to protect data on the file instead of block level. This JIRA indeed is aimed as the last resort for desperate situations. It's similar to recovering data directly from hard disk drives when the file system is corrupt beyond recovery. It's fully controlled by the DN and is the last layer of protection when all layers above have failed (trash mistakenly emptied, snapshots not correctly setup, etc.).
First use case, NN deletes blocks without deleting files. Have you seen an instance of this? It would be great to get one pager on how one handles this condition.
One possible situation (recently fixed by
HDFS-7960) is that NN mistakenly considers some blocks as over replicated, caused by zombie storages. Even though HDFS-7960 is already fixed, we should do something to protect against possible future NN bugs. This is the crux of why file-level protections, although always desirable, are not always sufficient. It could be that the NN gets something wrong, and then we're left with irrecoverable data loss.
Does NN keep deleting the blocks until it is hot fixed?
In the above case, NN will delete all replicas it considers over replicated until hot fixed.
Also completing deletion of blocks in a timely manner is important for a running cluster.
Yes this is a valid concern. Empirically, most customer clusters do not run even close to near disk capacity. Therefore, adding a reasonable grace period shouldn't delay allocating new blocks. The configured delay window should also be enforced under the constraint of available space (e.g., don't delay deletion when available disk space < 10%). We will also add Web UI and metrics support to clearly show the space consumption by deletion-delayed replicas.
All files don't require the same reliability. Intermediate data and tmp files need to be deleted immediately to free up cluster storage to avoid the risk of running out of storage space. At datanode level, there is no notion of whether files are temporary or important ones that need to be preserved. So a trash such as this can result in retaining lot of tmp files and deletes not being able to free up storage with in the cluster fast enough.
This is a great point. The proposed work (at least in the first phase) is intended as a best-effort optimization and will always yield to foreground workloads. The target is to statistically reduce the chance and severity of data losses given typical storage consumption conditions. It's certainly still possible for wave of tmp data to flush out more important data in DN trashes. We can design some smart eviction algorithms as future work.
As I commented above, we are considering a more radical approach as a potential next phase of this work, where deletion-delayed replicas will just be overwritten by incoming replicas. In that case we might not even need to count deletion-delayed replicas in the space quota, making the feature more transparent to admins.