Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12866

Recursive delete of a large directory or snapshot makes namenode unresponsive

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • hdfs
    • None

    Description

      Currently file/directory deletion happens in two steps (see FSNamesystem#delete(String src, boolean recursive, boolean logRetryCache):

      1. Do the following under fsn write lock and release the lock afterwards
        • 1.1 recursively traverse the target, collect INodes and all blocks to be deleted
        • 1.2 delete all INodes
      2. Delete the blocks to be deleted incrementally, chunk by chunk. That is, in a loop, do:
        • acquire fsn write lock,
        • delete chunk of blocks
        • release fsn write lock

      Breaking the deletion to two steps is to not hold the fsn write lock for too long thus making NN not responsive. However, even with this, for deleting large directory, or deleting snapshot that has a lot of contents, step 1 itself would takes long time thus still hold the fsn write lock for too long and make NN not responsive.

      A possible solution would be to add one more sub step in step 1, and only hold fsn write lock in sub step 1.1:

      • 1.1. hold the fsn write lock, disconnect the target to be deleted from its parent dir, release the lock
      • 1.2 recursively traverse the target, collect INodes and all blocks to be deleted
      • 1.3 delete all INodes

      Then do step 2.

      This means, any operations on any file/dir need to check if its ancestor is deleted (ancestor is disconnected), similar to what's done in FSNamesystem#isFileDeleted method.

      I'm throwing the thought here for further discussion. Welcome comments and inputs.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yzhangal Yongjun Zhang
              Votes:
              1 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: