Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12352

Delay in checkpointing Trash can leave trash for 2 intervals before deleting

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.7.1
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: trash
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixes an Trash related issue wherein a delay in the periodic checkpointing of one user's directory causes the subsequent user directory checkpoints to carry a newer timestamp, thereby delaying their eventual deletion.

      Description

      When the fs.trash.checkpoint.interval and the fs.trash.interval are set non-zero and the same, it is possible for trash to be left for two intervals.

      The TrashPolicyDefault will use a floor and ceiling function to ensure that the Trash will be checkpointed every "interval" of minutes.

      Each user's trash is checkpointed individually. The time resolution of the checkpoint timestamp is to the second.

      If the seconds switch while one user is checkpointing, then the next user's timestamp will be later.

      This will cause the next user's checkpoint to not be deleted at the next interval.

      I have recreated this in a lab cluster

      I also have a suggestion for a patch that I can upload later tonight after testing it further.

      1. HDFS-8118.patch
        2 kB
        Casey Brotherton
      2. HDFS-8118.001.patch
        2 kB
        Casey Brotherton

        Activity

        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #295 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/295/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #295 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/295/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #303 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/303/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #303 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/303/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2252 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2252/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2252 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2252/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #308 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/308/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #308 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/308/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2233 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2233/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2233 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2233/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk #1036 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1036/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1036 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1036/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8347 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8347/)
        HADOOP-12352. Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8347 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8347/ ) HADOOP-12352 . Delay in checkpointing Trash can leave trash for 2 intervals before deleting. Contributed by Casey Brotherton. (harsh: rev af78767870b8296886c03f8be24cf13a4e2bd4b0) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java
        Hide
        qwertymaniac Harsh J added a comment -

        Committed to branch-2 and trunk. Thank you for the find and fix Casey! Hope to see many more.

        Show
        qwertymaniac Harsh J added a comment - Committed to branch-2 and trunk. Thank you for the find and fix Casey! Hope to see many more.
        Hide
        qwertymaniac Harsh J added a comment -

        Thanks Casey,

        You can run individual tests locally via mvn test -Dtest=TestWebDelegationToken for example. The jenkins build cannot retrigger specific tests, but you can always check past/future builds to also inspect if the test has been generally flaky, and search JIRA/emails to see if this has/is already been reported/being worked on.

        It doesn't appear related to the behaviour fix we're making here, and the test does pass locally for me, so I'm committing this shortly.

        Show
        qwertymaniac Harsh J added a comment - Thanks Casey, You can run individual tests locally via mvn test -Dtest=TestWebDelegationToken for example. The jenkins build cannot retrigger specific tests, but you can always check past/future builds to also inspect if the test has been generally flaky, and search JIRA/emails to see if this has/is already been reported/being worked on. It doesn't appear related to the behaviour fix we're making here, and the test does pass locally for me, so I'm committing this shortly.
        Hide
        caseyjbrotherton Casey Brotherton added a comment -

        Issue in the common tests is a bind issue in the unit test: TestWebDelegationToken was a bind issue:

        java.net.BindException: Address already in use

        Is there a way to trigger the tests again to see if they clear independently? That would provide some indication whether there is a timing issue with the TestWebDelegationToken test.

        Thanks,
        Casey

        Show
        caseyjbrotherton Casey Brotherton added a comment - Issue in the common tests is a bind issue in the unit test: TestWebDelegationToken was a bind issue: java.net.BindException: Address already in use Is there a way to trigger the tests again to see if they clear independently? That would provide some indication whether there is a timing issue with the TestWebDelegationToken test. Thanks, Casey
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 17m 9s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 javac 7m 52s There were no new javac warning messages.
        +1 javadoc 9m 45s There were no new javadoc warning messages.
        +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 1m 7s There were no new checkstyle issues.
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 28s mvn install still works.
        +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
        +1 findbugs 1m 56s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        -1 common tests 22m 16s Tests failed in hadoop-common.
            62m 33s  



        Reason Tests
        Failed unit tests hadoop.security.token.delegation.web.TestWebDelegationToken



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12750026/HDFS-8118.001.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / feaf034
        hadoop-common test log https://builds.apache.org/job/PreCommit-HDFS-Build/12088/artifact/patchprocess/testrun_hadoop-common.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/12088/testReport/
        Java 1.7.0_55
        uname Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/12088/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 9s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 52s There were no new javac warning messages. +1 javadoc 9m 45s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 1m 7s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 28s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 56s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 common tests 22m 16s Tests failed in hadoop-common.     62m 33s   Reason Tests Failed unit tests hadoop.security.token.delegation.web.TestWebDelegationToken Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12750026/HDFS-8118.001.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / feaf034 hadoop-common test log https://builds.apache.org/job/PreCommit-HDFS-Build/12088/artifact/patchprocess/testrun_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/12088/testReport/ Java 1.7.0_55 uname Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HDFS-Build/12088/console This message was automatically generated.
        Hide
        qwertymaniac Harsh J added a comment -

        I re-looked at the change and the problem, and although this can be difficult to test, the change does certainly fix the described changing-timestamp behaviour.

        +1, will commit after verifying Jenkins result on the newer patch. Marking as Patch Available to trigger the build.

        Show
        qwertymaniac Harsh J added a comment - I re-looked at the change and the problem, and although this can be difficult to test, the change does certainly fix the described changing-timestamp behaviour. +1, will commit after verifying Jenkins result on the newer patch. Marking as Patch Available to trigger the build.
        Hide
        caseyjbrotherton Casey Brotherton added a comment -

        This is a simplified patch addressing only the defect, and not the testcases.

        Show
        caseyjbrotherton Casey Brotherton added a comment - This is a simplified patch addressing only the defect, and not the testcases.
        Hide
        caseyjbrotherton Casey Brotherton added a comment -

        Talked with Harsh about this offline.

        Still working through a testcase. We have talked about a method for delaying between the different checkpoints.
        There are still difficulties, though , as the problem occurs when two different users delete files, and there is a delay between checkpointing each user's deletes.

        However, TestTrash operates under the assumption that there is only one user that is creating files, and removing them for the trash.
        ( For example, the shell is used to getCurrentTrashDirectory, and that will only return one location. For a test, I will either need to
        break Object Oriented walls, and create a path with users/d/.Trash/Current, or create a test that doesn't really test for the issue. )

        Still trying to work on this.

        Show
        caseyjbrotherton Casey Brotherton added a comment - Talked with Harsh about this offline. Still working through a testcase. We have talked about a method for delaying between the different checkpoints. There are still difficulties, though , as the problem occurs when two different users delete files, and there is a delay between checkpointing each user's deletes. However, TestTrash operates under the assumption that there is only one user that is creating files, and removing them for the trash. ( For example, the shell is used to getCurrentTrashDirectory, and that will only return one location. For a test, I will either need to break Object Oriented walls, and create a path with users/d/.Trash/Current, or create a test that doesn't really test for the issue. ) Still trying to work on this.
        Hide
        qwertymaniac Harsh J added a comment -

        Thanks for explaining that Casey. It makes sense to constant-ise the checkpoint date for uniformity - and the fix for this looks alright to me.

        It also may make sense that people want to set checkpoint intervals equal to the trash intervals. I think we can remove the change in the patch of capping it to 1/2 the value of intervals, but just add a small doc note in hdfs-default.xml to the trash checkpoint period property on what the behaviour could end up being if its set to equal of the trash clearing interval.

        Would it also be possible to come up with a test-case for this? For example, load some files into trash such that multiple dirs need to be checkpointed, and issue a checkpoint (or await its lowered interval) and ensure only one date is observed before clearing occurs? It would help avoid regressions in future, just in case.

        Show
        qwertymaniac Harsh J added a comment - Thanks for explaining that Casey. It makes sense to constant-ise the checkpoint date for uniformity - and the fix for this looks alright to me. It also may make sense that people want to set checkpoint intervals equal to the trash intervals. I think we can remove the change in the patch of capping it to 1/2 the value of intervals, but just add a small doc note in hdfs-default.xml to the trash checkpoint period property on what the behaviour could end up being if its set to equal of the trash clearing interval. Would it also be possible to come up with a test-case for this? For example, load some files into trash such that multiple dirs need to be checkpointed, and issue a checkpoint (or await its lowered interval) and ensure only one date is observed before clearing occurs? It would help avoid regressions in future, just in case.
        Hide
        caseyjbrotherton Casey Brotherton added a comment -

        These are changes to fix the checkpoint for all users to the same checkpoint time.

        I have tested those changes locally. There are also changes to change the checkpoint to be a maximum of 1/2 of the deletion interval.

        I have not yet tested those changes, but will be doing so over the weekend.
        Let me know if that is what you were thinking about.

        Thanks,
        Casey

        Show
        caseyjbrotherton Casey Brotherton added a comment - These are changes to fix the checkpoint for all users to the same checkpoint time. I have tested those changes locally. There are also changes to change the checkpoint to be a maximum of 1/2 of the deletion interval. I have not yet tested those changes, but will be doing so over the weekend. Let me know if that is what you were thinking about. Thanks, Casey
        Hide
        caseyjbrotherton Casey Brotherton added a comment -

        Hello Harsh,

        Thanks for the comment.

        The sleeping time is based on fs.trash.checkpoint.interval.

        There is code to change the checkpoint interval to not be larger than the deletion interval in the constructor.
        It could be changed to ensure that the checkpoint is no more than half the deletion interval.

        One thing to consider is that currently the maximum time that you can recover a deleted file from trash is a fs.trash.checkpoint.interval + fs.trash.interval.

        ( With both trash defaults set to 24 hours, deleting a file one minute after midnight, that file gets placed into a checkpoint 24 hours later,
        and the checkpoint is deleted 24 hours later. )

        Thanks,
        Casey

        Show
        caseyjbrotherton Casey Brotherton added a comment - Hello Harsh, Thanks for the comment. The sleeping time is based on fs.trash.checkpoint.interval. There is code to change the checkpoint interval to not be larger than the deletion interval in the constructor. It could be changed to ensure that the checkpoint is no more than half the deletion interval. One thing to consider is that currently the maximum time that you can recover a deleted file from trash is a fs.trash.checkpoint.interval + fs.trash.interval. ( With both trash defaults set to 24 hours, deleting a file one minute after midnight, that file gets placed into a checkpoint 24 hours later, and the checkpoint is deleted 24 hours later. ) Thanks, Casey
        Hide
        qwertymaniac Harsh J added a comment -

        Does it make sense to not let the checkpoint time be equal to the clearing interval though? I'm not certain what benefit comes out of it anyway.

        Show
        qwertymaniac Harsh J added a comment - Does it make sense to not let the checkpoint time be equal to the clearing interval though? I'm not certain what benefit comes out of it anyway.

          People

          • Assignee:
            caseyjbrotherton Casey Brotherton
            Reporter:
            caseyjbrotherton Casey Brotherton
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development