Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-173

Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Delete a directory with millions of files. This could take several minutes (observed 12 mins for 9 million files). While the operation is in progress FSNamesystem lock is held and the requests from clients are not handled until deletion completes.

      1. HDFS-173.1.patch
        14 kB
        Suresh Srinivas
      2. HDFS-173.2.patch
        17 kB
        Suresh Srinivas
      3. HDFS-173.3.patch
        17 kB
        Suresh Srinivas
      4. HDFS-173.4.patch
        17 kB
        Suresh Srinivas
      5. HDFS-173.patch
        16 kB
        Suresh Srinivas

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          Is it possible that FSNamesystem.removePathAndBlocks() is the major bottleneck? If so, we can maybe re-arrange the code to keep the "freeing up resource" part of the code outside the FSNamesystem lock.

          Show
          dhruba borthakur added a comment - Is it possible that FSNamesystem.removePathAndBlocks() is the major bottleneck? If so, we can maybe re-arrange the code to keep the "freeing up resource" part of the code outside the FSNamesystem lock.
          Hide
          Suresh Srinivas added a comment -

          Here is the approach I am planning to take:
          Currently all the files and subdirectories is recursively deleted holding FSNameSystem lock. Deletion of a directory with large number of files could take significant time. This results in other operations having to wait until the completion of delete to perform any namenode operation.

          The proposed change is to perform deletion incrementally. First the target directory to be deleted will be removed from the directory tree, to prevent further changes in that directory. Then the files and directories under it are deleted 1000 at a time, holding the FSNameSystem lock. Between two deletes, relinquish FSNameSystem lock, to allow namenode operations for other clients. There is no sleep between deletes, as it could add up to long deletion time, when deleting a large directory. Client that issues delete would block until the deletion is complete.

          Show
          Suresh Srinivas added a comment - Here is the approach I am planning to take: Currently all the files and subdirectories is recursively deleted holding FSNameSystem lock. Deletion of a directory with large number of files could take significant time. This results in other operations having to wait until the completion of delete to perform any namenode operation. The proposed change is to perform deletion incrementally. First the target directory to be deleted will be removed from the directory tree, to prevent further changes in that directory. Then the files and directories under it are deleted 1000 at a time, holding the FSNameSystem lock. Between two deletes, relinquish FSNameSystem lock, to allow namenode operations for other clients. There is no sleep between deletes, as it could add up to long deletion time, when deleting a large directory. Client that issues delete would block until the deletion is complete.
          Hide
          dhruba borthakur added a comment -

          The proposal sounds like a feasible idea.

          A related problem is that a Namenode server handler thread will remain busy during all this time. Is that a problem that need to be solved too?

          Show
          dhruba borthakur added a comment - The proposal sounds like a feasible idea. A related problem is that a Namenode server handler thread will remain busy during all this time. Is that a problem that need to be solved too?
          Hide
          Suresh Srinivas added a comment -

          Do we use a single thread or thread pool for Namenode server handler? Given that we want to keep the client blocked on delete call, even if we were to use a separate thread for deletion, cannot think of way to unblock the the Namenode server handler to handle other requests.

          Show
          Suresh Srinivas added a comment - Do we use a single thread or thread pool for Namenode server handler? Given that we want to keep the client blocked on delete call, even if we were to use a separate thread for deletion, cannot think of way to unblock the the Namenode server handler to handle other requests.
          Hide
          Koji Noguchi added a comment -

          A related problem is that a Namenode server handler thread will remain busy during all this time. Is that a problem that need to be solved too?

          In our clusters, this isn't a problem since it's either Namenode's expunge thread or ops deleting the Trash.

          Another approach could be to split the large directory on the client side and delete chunk at a time.
          (I'm assuming there's no contract for delete to be atomic.)

          Show
          Koji Noguchi added a comment - A related problem is that a Namenode server handler thread will remain busy during all this time. Is that a problem that need to be solved too? In our clusters, this isn't a problem since it's either Namenode's expunge thread or ops deleting the Trash. Another approach could be to split the large directory on the client side and delete chunk at a time. (I'm assuming there's no contract for delete to be atomic.)
          Hide
          Suresh Srinivas added a comment -

          Currently looks like default number of threads used for Namenode RPC is 10. So one of the threads will block on Client request.

          Show
          Suresh Srinivas added a comment - Currently looks like default number of threads used for Namenode RPC is 10. So one of the threads will block on Client request.
          Hide
          Raghu Angadi added a comment -

          > First the target directory to be deleted will be removed from the directory tree, to prevent further changes in that directory.

          Just trying to see if there are some consistency issues :

          What about other references to files under the tree. E.g. if a file is still being written, there are references to it from lease manager, that could be an inconsistent view of the namesystem. I think this one can be handled

          What about log entry? Is the deletion log written at the beginning or the when the whole deletion is completed? (At the beginning seems more accurate).. Currently can recursive deletes fail for any reason in the middle (and are those rolled back?)?

          What is is the suspect for taking so long? Is it mainly the memory allocations?

          Show
          Raghu Angadi added a comment - > First the target directory to be deleted will be removed from the directory tree, to prevent further changes in that directory. Just trying to see if there are some consistency issues : What about other references to files under the tree. E.g. if a file is still being written, there are references to it from lease manager, that could be an inconsistent view of the namesystem. I think this one can be handled What about log entry? Is the deletion log written at the beginning or the when the whole deletion is completed? (At the beginning seems more accurate).. Currently can recursive deletes fail for any reason in the middle (and are those rolled back?)? What is is the suspect for taking so long? Is it mainly the memory allocations?
          Hide
          Suresh Srinivas added a comment -

          The patch changes the following that needs closer review:

          1. When a large directory is deleted, the target directory is unlinked from the directory structure first. The deletion is recorded immediately in the editslog.
          2. Modification time (not sure if this is important) for deletion of subdirectories and file is the time when delete operation was received.
          3. Number of inodes to delete at a time is chosen as 1000 arbitrarily.

          What about other references to files under the tree. E.g. if a file is still being written, there are references to it from lease manager, that could be an inconsistent view of the namesystem. I think this one can be handled

          I need more information about this. I do not think this patch is handling it right, currently. I will talk to you.

          What about log entry? Is the deletion log written at the beginning or the when the whole deletion is completed?

          At the beginning

          Currently can recursive deletes fail for any reason in the middle (and are those rolled back?)?

          Current code does not seem to roll back the change if a deletion fails for some reason in the middle. This may be not be a problem because deletion of blocks is done towards the end and hence no data is lost on failures.

          What is is the suspect for taking so long? Is it mainly the memory allocations?

          On my test setup deletion of files takes much shorter time than what was observed on production cluster. 20 seconds for deleting 1 million files.

          Show
          Suresh Srinivas added a comment - The patch changes the following that needs closer review: When a large directory is deleted, the target directory is unlinked from the directory structure first. The deletion is recorded immediately in the editslog. Modification time (not sure if this is important) for deletion of subdirectories and file is the time when delete operation was received. Number of inodes to delete at a time is chosen as 1000 arbitrarily. What about other references to files under the tree. E.g. if a file is still being written, there are references to it from lease manager, that could be an inconsistent view of the namesystem. I think this one can be handled I need more information about this. I do not think this patch is handling it right, currently. I will talk to you. What about log entry? Is the deletion log written at the beginning or the when the whole deletion is completed? At the beginning Currently can recursive deletes fail for any reason in the middle (and are those rolled back?)? Current code does not seem to roll back the change if a deletion fails for some reason in the middle. This may be not be a problem because deletion of blocks is done towards the end and hence no data is lost on failures. What is is the suspect for taking so long? Is it mainly the memory allocations? On my test setup deletion of files takes much shorter time than what was observed on production cluster. 20 seconds for deleting 1 million files.
          Hide
          Suresh Srinivas added a comment -

          Currently following steps are taken for recursive deletion of directories:

          1. remove target inode from the parents. This recursively traverses the subtree to compute quota used by the subtree to update the parent node quota.
          2. travers the subtree of target node again and delete inodes and collect blocks
          3. remove lease for the target node and the subtree
          4. delete blocks from the block map, corrupt map and schedule block for deletion at datanode by adding it to BlockManager.recentInvalidateSets

          While adding to BlockManager.recentInvalidateSets, in step 4 above, an info level log is printed for every replica. In my testing, deletion time for 1 million files with a block each, came down from 15 seconds to 3 seconds on commenting this log out.

          In the previous patch there was concern that during incremental deletion, the file inodes from the target node is still referenced in the lease and by the blocks.

          Here is the proposed change:
          Step 1 - FSNamesystem lock is held for longer time here

          1. remove target inode from the parents. At this time, two tree traversals are from the previous implementation will be combined to one. Computing quota used by the subtree, deletion of inodes and collecting blocks will be done in a single step. In this step, while collecting blocks, the inode referenced by the block will be set to null.

          Step 2: lock held for shorted duration for incremental block deletion

          1. Over multiple iteration, collected blocks will be deleted from block map, corrupt map and added BlockManager.recentInvalidateSets
          2. Logging of addition to invalidate sets will be done outside the lock. Instead of printing a log for every block replica, a single log will be printed for all the replicas of a block.
          Show
          Suresh Srinivas added a comment - Currently following steps are taken for recursive deletion of directories: remove target inode from the parents. This recursively traverses the subtree to compute quota used by the subtree to update the parent node quota. travers the subtree of target node again and delete inodes and collect blocks remove lease for the target node and the subtree delete blocks from the block map, corrupt map and schedule block for deletion at datanode by adding it to BlockManager.recentInvalidateSets While adding to BlockManager.recentInvalidateSets , in step 4 above, an info level log is printed for every replica. In my testing, deletion time for 1 million files with a block each, came down from 15 seconds to 3 seconds on commenting this log out. In the previous patch there was concern that during incremental deletion, the file inodes from the target node is still referenced in the lease and by the blocks. Here is the proposed change: Step 1 - FSNamesystem lock is held for longer time here remove target inode from the parents. At this time, two tree traversals are from the previous implementation will be combined to one. Computing quota used by the subtree, deletion of inodes and collecting blocks will be done in a single step. In this step, while collecting blocks, the inode referenced by the block will be set to null. Step 2: lock held for shorted duration for incremental block deletion Over multiple iteration, collected blocks will be deleted from block map, corrupt map and added BlockManager.recentInvalidateSets Logging of addition to invalidate sets will be done outside the lock. Instead of printing a log for every block replica, a single log will be printed for all the replicas of a block.
          Hide
          Raghu Angadi added a comment -

          The above looks better.. simpler to show correctness.

          The key observation is that logging of each block deletion is the main culprit (80 - 90% of the time spent). The above method moves block deletion to out side the main deletion.

          With the patch, when some one deletes directory tree with 10M it might still lock NN on the order of a minute.

          Another approach could be to emulate a clients recursive deletion : This would not involve any semantic changes to NN internals.
          It would be something like :

          delete(dir) {
              lastDeleted = "";
              while ( lastDeleted != dir) {
                    subTree = findSubTree(dir, max_children = 1000); //under lock
                    delete(subTree) // under lock.
                      lastDeleted = subTree;
             }
          } 

          This essentially breaks one deletion in to many deletions for large trees. It also implies there would be many deletion entries in edit log, which is not an issue.

          findSubTree() would not iterate through entire tree each time. It stops depth-first traversal at an inner node if number of files (or blocks) under it is larger than given limit.

          Show
          Raghu Angadi added a comment - The above looks better.. simpler to show correctness. The key observation is that logging of each block deletion is the main culprit (80 - 90% of the time spent). The above method moves block deletion to out side the main deletion. With the patch, when some one deletes directory tree with 10M it might still lock NN on the order of a minute. Another approach could be to emulate a clients recursive deletion : This would not involve any semantic changes to NN internals. It would be something like : delete(dir) { lastDeleted = ""; while ( lastDeleted != dir) { subTree = findSubTree(dir, max_children = 1000); //under lock delete(subTree) // under lock. lastDeleted = subTree; } } This essentially breaks one deletion in to many deletions for large trees. It also implies there would be many deletion entries in edit log, which is not an issue. findSubTree() would not iterate through entire tree each time. It stops depth-first traversal at an inner node if number of files (or blocks) under it is larger than given limit.
          Hide
          Suresh Srinivas added a comment -

          Additionally the approach Raghu is proposing will have the following side effects compared to the previous solution:

          1. After a client issues delete for a directory, some parts of the directory that are not deleted are still available for other clients to create files, modify files and other operations
          2. During delete, if the namenode dies in the middle, the deletion of the directory is partial. When namenode comes up, it will come up with partial directories. Not sure how client can recover from this, other than remembering that it had tried delete and needs to delete the directory again.
          Show
          Suresh Srinivas added a comment - Additionally the approach Raghu is proposing will have the following side effects compared to the previous solution: After a client issues delete for a directory, some parts of the directory that are not deleted are still available for other clients to create files, modify files and other operations During delete, if the namenode dies in the middle, the deletion of the directory is partial. When namenode comes up, it will come up with partial directories. Not sure how client can recover from this, other than remembering that it had tried delete and needs to delete the directory again.
          Hide
          Suresh Srinivas added a comment -

          Attaching a patch described in my proposal above. Two changes from the proposal:

          • Logging is still done with lock held to reduce changes
          • Added a sleep of 1 msec between freeing up 1000 blocks. Without this, in my test, other threads that perform namenode operations could not get the lock frequently, during deletion. The sleep time adds up to 10 sec for 10 million files.
          Show
          Suresh Srinivas added a comment - Attaching a patch described in my proposal above. Two changes from the proposal: Logging is still done with lock held to reduce changes Added a sleep of 1 msec between freeing up 1000 blocks. Without this, in my test, other threads that perform namenode operations could not get the lock frequently, during deletion. The sleep time adds up to 10 sec for 10 million files.
          Hide
          Raghu Angadi added a comment -

          couple of comments :

          1. deleting a file and and deleting a directory takes different code paths. Earlier it was the same. I think it is better keep one code path. It will simplify the patch. We need only one of FSDirectory.unlinkDirectory() or FSDirectory.delete().
          2. The sleep is invoked before rather than after the block deletion. results in one redundant sleep and limits deletion rate.
            • How badly were the other threads starved without the sleep? Hopefully we don't need the sleep.
          Show
          Raghu Angadi added a comment - couple of comments : deleting a file and and deleting a directory takes different code paths. Earlier it was the same. I think it is better keep one code path. It will simplify the patch. We need only one of FSDirectory.unlinkDirectory() or FSDirectory.delete() . The sleep is invoked before rather than after the block deletion. results in one redundant sleep and limits deletion rate. How badly were the other threads starved without the sleep? Hopefully we don't need the sleep.
          Hide
          Suresh Srinivas added a comment -

          Attaching a patch that incorporates changes from comments. The changes:

          1. Deleting file and directory now use same code path
          2. No sleep during block deletion and related changes in the test to create more blocks during the test
          Show
          Suresh Srinivas added a comment - Attaching a patch that incorporates changes from comments. The changes: Deleting file and directory now use same code path No sleep during block deletion and related changes in the test to create more blocks during the test
          Hide
          Suresh Srinivas added a comment -

          Minor change - reusing FSNamesystem.removePathAndBlock() and removed newly added removePath(). Added some comments to clarify the code better.

          Show
          Suresh Srinivas added a comment - Minor change - reusing FSNamesystem.removePathAndBlock() and removed newly added removePath(). Added some comments to clarify the code better.
          Hide
          Raghu Angadi added a comment -

          +1. Looks good. Thanks for the multiple iterations.

          Show
          Raghu Angadi added a comment - +1. Looks good. Thanks for the multiple iterations.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418318/HDFS-173.3.patch
          against trunk revision 810337.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418318/HDFS-173.3.patch against trunk revision 810337. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/5/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          In the test the dfs.block.size was not a multiple fo io.bytes.per.checksum. This resulted in cluster not starting up. Setting io.bytes.per.checksum appropriately.

          Show
          Suresh Srinivas added a comment - In the test the dfs.block.size was not a multiple fo io.bytes.per.checksum. This resulted in cluster not starting up. Setting io.bytes.per.checksum appropriately.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418406/HDFS-173.4.patch
          against trunk revision 810504.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418406/HDFS-173.4.patch against trunk revision 810504. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/7/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Failing test TestBlocksWithNotEnoughRack is not related to this patch.

          Show
          Suresh Srinivas added a comment - Failing test TestBlocksWithNotEnoughRack is not related to this patch.
          Hide
          Suresh Srinivas added a comment -

          I have committed the change.

          Show
          Suresh Srinivas added a comment - I have committed the change.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #13 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/13/)
          . Namenode will not block until a large directory deletion completes. It allows other operations when the deletion is in progress.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #13 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/13/ ) . Namenode will not block until a large directory deletion completes. It allows other operations when the deletion is in progress.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #72 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/72/)
          . Namenode will not block until a large directory deletion completes. It allows other operations when the deletion is in progress.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #72 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/72/ ) . Namenode will not block until a large directory deletion completes. It allows other operations when the deletion is in progress.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #4 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/4/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #4 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/4/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #21 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #21 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/ )

            People

            • Assignee:
              Suresh Srinivas
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development