Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-6761

Improve Trash Emptier

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.22.0
    • None
    • None


      There are two inefficiencies in the Trash functionality right now that have caused some problems for us.
      First if you configured your trash interval to be one day (24 hours) that means that you store 2 days worth of data eventually. The Current and the previous timestamp that will not be deleted until the end of the interval.
      And another problem is accumulating a lot of data in Trash before the Emptier wakes up. If there are a couple of million files trashed and the Emptier does deletion on HDFS the NameNode will freeze until everything is removed. (this particular problem hopefully will be addressed with HDFS-1143).

      My proposal is to have two configuration intervals. One for deleting the trashed data and another for checkpointing. This way for example for intervals of one day and one hour we will only store 25 hours of data instead of 48 right now and the deletions will be happening in smaller chunks every hour of the day instead of a huge deletion at the end of the day now.


        1. HADOOP-6761.patch
          4 kB
          Dmytro Molkov
        2. HADOOP-6761.5.patch
          11 kB
          Dmytro Molkov
        3. HADOOP-6761.4.patch
          11 kB
          Dmytro Molkov
        4. HADOOP-6761.3.patch
          11 kB
          Dmytro Molkov
        5. HADOOP-6761.2.patch
          11 kB
          Dmytro Molkov


          This comment will be Viewable by All Users Viewable by All Users


            dms Dmytro Molkov
            dms Dmytro Molkov
            0 Vote for this issue
            6 Start watching this issue




                Issue deployment