Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      There are a number of problems with Trash that continue to result in permanent data loss for users. The primary reasons trash is not used:

      • Trash is configured client-side and not enabled by default.
      • Trash is shell-only. FileSystem, WebHDFS, HttpFs, etc never use trash.
      • If trash fails, for example, because we can't create the trash directory or the move itself fails, trash is bypassed and the data is deleted.

      Trash was designed as a feature to help end users via the shell, however in my experience the primary use of trash is to help administrators implement data retention policies (this was also the motivation for HADOOP-7460). One could argue that (periodic read-only) snapshots are a better solution to this problem, however snapshots are not slated for Hadoop 2.x and trash is complimentary to snapshots (and backup) - eg you may create and delete data within your snapshot or backup window - so it makes sense to revisit trash's design. I think it's worth bringing trash's functionality in line with what users need.

      I propose we enable trash on a per-filesystem basis and implement it server-side. Ie trash becomes an HDFS feature enabled by administrators. Because the trash emptier lives in HDFS and users already have a per-filesystem trash directory we're mostly there already. The design preference from HADOOP-2514 was for trash to be implemented in "user code" however (a) in light of these problems, (b) we have a lot more user-facing APIs than the shell and (c) clients increasingly span file systems (via federation and symlinks) this design choice makes less sense. This is why we already use a per-filesystem trash/home directory instead of the user's client-configured one - otherwise trash would not work because renames can't span file systems.

      In short, HDFS trash would work similarly to how it does today, the difference is that client delete APIs would result in a rename into trash (ala TrashPolicyDefault#moveToTrash) if trash is enabled. Like today it would be renamed to the trash directory on the file system where the file being removed resides. The primary difference is that enablement and policy are configured server-side by adminstrators and is used regardless of the API used to access the filesytem. The one execption to this is that I think we should continue to support the explict skipTrash shell option. The rationale for skipTrash (HADOOP-6080) is that a move to trash may fail in cases where a rm may not, if a user has a home directory quota and does a rmr /tonsOfData, for example. Without a way to bypass this the user has no way (unless we revisit quotas, permissions or trash paths) to remove a directory they have permissions to remove without getting their quota adjusted by an admin. The skip trash API can be implemented by adding an explicit FileSystem API that bypasses trash and modifying the shell to use it when skipTrash is enabled. Given that users must explicitly specify skipTrash the API is less error prone. We could have the shell ask confirmation and annotate the API private to FsShell to discourage programatic use. This is not ideal but can be done compatibly (unlike redefining quotas, permissions or trash paths).

      In terms of compatibility, while this proposal is technically an incompatible change (client side configuration that disables trash and uses skipTrash with a previous FsShell release will now both be ignored if server-side trash is enabled, and non-HDFS file systems would need to make similar changes) I think it's worth targeting for Hadoop 2.x given that the new semantics preserve the current semantics. In 2.x I think we should preserve FsShell based trash and support both it and server-side trash (defaults to disabled). For trunk/3.x I think we should remove the FsShell based trash entirely and enable server-side trash by default.

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          Linking in some related issues.

          Show
          Eli Collins added a comment - Linking in some related issues.
          Hide
          Eli Collins added a comment -

          Forgot to mention that the pluggae trash policy makes less sense server side, should probably be replaced with a delete hook in FsShell since reasonable policies might want to do things that ant run in the NN.

          Show
          Eli Collins added a comment - Forgot to mention that the pluggae trash policy makes less sense server side, should probably be replaced with a delete hook in FsShell since reasonable policies might want to do things that ant run in the NN.
          Hide
          Aaron T. Myers added a comment -

          I think a quick and easy option for implementing "server-side trash" would be to just not require any configuration client-side. This could be done by adding an "isTrashEnabled" field to FsServerDefaults and changing the TrashPolicyDefault#isEnabled method to call Filesystem#getServerDefaults and return the value of isTrashEnabled. This won't result in a ton of extra RPCs to the NN since the HDFS implementation of getServerDefaults is cached in the DFSClient and rate-limited to one call per hour. Though it obviously wouldn't cover the issue of APIs besides the FsShell, I think this would cover a very large portion of the user problems we currently see with trash.

          Show
          Aaron T. Myers added a comment - I think a quick and easy option for implementing "server-side trash" would be to just not require any configuration client-side. This could be done by adding an "isTrashEnabled" field to FsServerDefaults and changing the TrashPolicyDefault#isEnabled method to call Filesystem#getServerDefaults and return the value of isTrashEnabled. This won't result in a ton of extra RPCs to the NN since the HDFS implementation of getServerDefaults is cached in the DFSClient and rate-limited to one call per hour. Though it obviously wouldn't cover the issue of APIs besides the FsShell, I think this would cover a very large portion of the user problems we currently see with trash.
          Hide
          Daryn Sharp added a comment -

          I haven't digested this jira, but be sure to keep viewfs in mind. Ie. FileSystem#getServerDefaults(Path) would be required to get the info for the right mount point.

          Show
          Daryn Sharp added a comment - I haven't digested this jira, but be sure to keep viewfs in mind. Ie. FileSystem#getServerDefaults(Path) would be required to get the info for the right mount point.
          Hide
          Eli Collins added a comment -

          @ATM, like your idea, though it will only work with new clients (modified to check the server config) and FsShell it seems like a reasonable approach for branch-2.

          Show
          Eli Collins added a comment - @ATM, like your idea, though it will only work with new clients (modified to check the server config) and FsShell it seems like a reasonable approach for branch-2.
          Hide
          Eli Collins added a comment -

          Filed HADOOP-8689 for v2 per ATM's suggestion so re-targeting this change for trunk/v3.

          Show
          Eli Collins added a comment - Filed HADOOP-8689 for v2 per ATM's suggestion so re-targeting this change for trunk/v3.

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:

                Development