Type: New Feature
Affects Version/s: None
Fix Version/s: None
There are a number of problems with Trash that continue to result in permanent data loss for users. The primary reasons trash is not used:
- Trash is configured client-side and not enabled by default.
- Trash is shell-only. FileSystem, WebHDFS, HttpFs, etc never use trash.
- If trash fails, for example, because we can't create the trash directory or the move itself fails, trash is bypassed and the data is deleted.
Trash was designed as a feature to help end users via the shell, however in my experience the primary use of trash is to help administrators implement data retention policies (this was also the motivation for
HADOOP-7460). One could argue that (periodic read-only) snapshots are a better solution to this problem, however snapshots are not slated for Hadoop 2.x and trash is complimentary to snapshots (and backup) - eg you may create and delete data within your snapshot or backup window - so it makes sense to revisit trash's design. I think it's worth bringing trash's functionality in line with what users need.
I propose we enable trash on a per-filesystem basis and implement it server-side. Ie trash becomes an HDFS feature enabled by administrators. Because the trash emptier lives in HDFS and users already have a per-filesystem trash directory we're mostly there already. The design preference from
HADOOP-2514 was for trash to be implemented in "user code" however (a) in light of these problems, (b) we have a lot more user-facing APIs than the shell and (c) clients increasingly span file systems (via federation and symlinks) this design choice makes less sense. This is why we already use a per-filesystem trash/home directory instead of the user's client-configured one - otherwise trash would not work because renames can't span file systems.
In short, HDFS trash would work similarly to how it does today, the difference is that client delete APIs would result in a rename into trash (ala TrashPolicyDefault#moveToTrash) if trash is enabled. Like today it would be renamed to the trash directory on the file system where the file being removed resides. The primary difference is that enablement and policy are configured server-side by adminstrators and is used regardless of the API used to access the filesytem. The one execption to this is that I think we should continue to support the explict skipTrash shell option. The rationale for skipTrash (
HADOOP-6080) is that a move to trash may fail in cases where a rm may not, if a user has a home directory quota and does a rmr /tonsOfData, for example. Without a way to bypass this the user has no way (unless we revisit quotas, permissions or trash paths) to remove a directory they have permissions to remove without getting their quota adjusted by an admin. The skip trash API can be implemented by adding an explicit FileSystem API that bypasses trash and modifying the shell to use it when skipTrash is enabled. Given that users must explicitly specify skipTrash the API is less error prone. We could have the shell ask confirmation and annotate the API private to FsShell to discourage programatic use. This is not ideal but can be done compatibly (unlike redefining quotas, permissions or trash paths).
In terms of compatibility, while this proposal is technically an incompatible change (client side configuration that disables trash and uses skipTrash with a previous FsShell release will now both be ignored if server-side trash is enabled, and non-HDFS file systems would need to make similar changes) I think it's worth targeting for Hadoop 2.x given that the new semantics preserve the current semantics. In 2.x I think we should preserve FsShell based trash and support both it and server-side trash (defaults to disabled). For trunk/3.x I think we should remove the FsShell based trash entirely and enable server-side trash by default.