Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5790

LeaseManager.findPath is very slow when many leases need recovery

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: namenode, performance
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Committed to branch-2 and trunk.

      Description

      We recently saw an issue where the NN restarted while tens of thousands of files were open. The NN then ended up spending multiple seconds for each commitBlockSynchronization() call, spending most of its time inside LeaseManager.findPath(). findPath currently works by looping over all files held for a given writer, and traversing the filesystem for each one. This takes way too long when tens of thousands of files are open by a single writer.

      1. hdfs-5790.txt
        4 kB
        Todd Lipcon
      2. hdfs-5790.txt
        5 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk #1659 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1659/)
          HDFS-5790. LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1659 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1659/ ) HDFS-5790 . LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1684 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1684/)
          HDFS-5790. LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1684 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1684/ ) HDFS-5790 . LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #467 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/467/)
          HDFS-5790. LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #467 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/467/ ) HDFS-5790 . LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Hide
          Suresh Srinivas added a comment -

          I know that many of the HDFS restarts with running jobs that have opened many files run into this issue. In the past I had fixed a bug where namenode did editlog sync holding lock. Even with that I see that this issue slows down lease recovery and namenode in such restarts becomes unresponsive. That said, I am okay not putting this into 2.3.

          Show
          Suresh Srinivas added a comment - I know that many of the HDFS restarts with running jobs that have opened many files run into this issue. In the past I had fixed a bug where namenode did editlog sync holding lock. Even with that I see that this issue slows down lease recovery and namenode in such restarts becomes unresponsive. That said, I am okay not putting this into 2.3.
          Hide
          Todd Lipcon added a comment -

          I was trying to follow the "blockers only" rule for 2.3. This is a bad problem when it happens, but it has been around for many years. So, given there is some risk associated with changing lease management, I figured I'd put it only in branch-2 and not 2.3.0. But, if you think it's worth it, I'm +0 on cherry-picking.

          Show
          Todd Lipcon added a comment - I was trying to follow the "blockers only" rule for 2.3. This is a bad problem when it happens, but it has been around for many years. So, given there is some risk associated with changing lease management, I figured I'd put it only in branch-2 and not 2.3.0. But, if you think it's worth it, I'm +0 on cherry-picking.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #5075 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5075/)
          HDFS-5790. LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5075 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5075/ ) HDFS-5790 . LeaseManager.findPath is very slow when many leases need recovery. Contributed by Todd Lipcon. (todd: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1562970 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          Hide
          Suresh Srinivas added a comment -

          Todd Lipcon, should this go into 2.3 as well?

          Show
          Suresh Srinivas added a comment - Todd Lipcon , should this go into 2.3 as well?
          Hide
          Todd Lipcon added a comment -

          Thanks for the analysis Kihwal. My logic was basically the same - glad to have it confirmed.

          Also, you're right - I'm pretty sure the "single writer" was NN_Recovery in the production case we saw as well, though it wasn't easy to verify (we don't appear to have any way to dump the LeaseManager state at runtime, which is a shame)

          I'll commit this in a day or two if no one has further comments.

          Show
          Todd Lipcon added a comment - Thanks for the analysis Kihwal. My logic was basically the same - glad to have it confirmed. Also, you're right - I'm pretty sure the "single writer" was NN_Recovery in the production case we saw as well, though it wasn't easy to verify (we don't appear to have any way to dump the LeaseManager state at runtime, which is a shame) I'll commit this in a day or two if no one has further comments.
          Hide
          Kihwal Lee added a comment -

          I wondered why commitBlockSynchronization() sometimes takes long and this jira explains why. When the original lease holders disappear, the lease holders are changed to namenode for block recovery. So if a lot of files get abandoned at around the same time, NN will be that writer with a large number of open files.

          The patch looks good. The paths managed by LeaseManager are supposed to be updated on deletions and renames, so there is no point in searching there when the reference to inode is already known. For all user-initiated calls, the inode is obtained using the user-supplied path and then checkLease() is called before calling findPath(). So if something is to fail in findPath(), it should fail earlier in the code path. The patch seems fine in terms of both consistency and correctness.

          +1

          Show
          Kihwal Lee added a comment - I wondered why commitBlockSynchronization() sometimes takes long and this jira explains why. When the original lease holders disappear, the lease holders are changed to namenode for block recovery. So if a lot of files get abandoned at around the same time, NN will be that writer with a large number of open files. The patch looks good. The paths managed by LeaseManager are supposed to be updated on deletions and renames, so there is no point in searching there when the reference to inode is already known. For all user-initiated calls, the inode is obtained using the user-supplied path and then checkLease() is called before calling findPath(). So if something is to fail in findPath(), it should fail earlier in the code path. The patch seems fine in terms of both consistency and correctness. +1
          Hide
          Todd Lipcon added a comment -

          This test has no new patches because it's just re-implementing an existing code path already covered by tests.

          Show
          Todd Lipcon added a comment - This test has no new patches because it's just re-implementing an existing code path already covered by tests.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12624258/hdfs-5790.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5933//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5933//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624258/hdfs-5790.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5933//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5933//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12624229/hdfs-5790.txt
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5932//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624229/hdfs-5790.txt against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5932//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Attached patch gets rid of the slow method and just calls INode.getFullPathName() everywhere it was used. My local test results above indicates this ought to work, and I don't see why it wouldn't, but would appreciate some reviews.

          Show
          Todd Lipcon added a comment - Attached patch gets rid of the slow method and just calls INode.getFullPathName() everywhere it was used. My local test results above indicates this ought to work, and I don't see why it wouldn't, but would appreciate some reviews.
          Hide
          Todd Lipcon added a comment -

          As a quick check of the above, I did the following patch:

          diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java b/hadoop-hdfs-p
          index 8b5fb81..62e60da 100644
          --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          @@ -40,6 +40,7 @@
           import org.apache.hadoop.util.Daemon;
           
           import com.google.common.annotations.VisibleForTesting;
          +import com.google.common.base.Objects;
           import com.google.common.base.Preconditions;
           
           /**
          @@ -256,6 +257,16 @@ public boolean expiredSoftLimit() {
                * @return the path associated with the pendingFile and null if not found.
                */
               private String findPath(INodeFile pendingFile) {
          +      String retOrig = findPathOrig(pendingFile);
          +      String retNew = pendingFile.getFullPathName();
          +      if (!Objects.equal(retOrig, retNew)) {
          +        throw new AssertionError("orig implementation found: " + retOrig +
          +                                 " new implementation found: " + retNew);
          +      }
          +      return retNew;
          +    }
          +
          +    private String findPathOrig(INodeFile pendingFile) {
                 try {
                   for (String src : paths) {
                     INode node = fsnamesystem.dir.getINode(src);
          

          that is to say, I try the suggested optimization, along with the original implementation, and verify that they return the same results. I ran all the HDFS tests and they all passed, indicating that it's likely this optimization wouldn't break anything. And, it should be much faster, since it's O(directory depth) instead of O(number of leases held by the client * those lease's directory depths).

          Anyone have opinions on this? Kihwal Lee or Daryn Sharp maybe? (seem to recall both of you working in this area a few months back)

          Show
          Todd Lipcon added a comment - As a quick check of the above, I did the following patch: diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java b/hadoop-hdfs-p index 8b5fb81..62e60da 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java @@ -40,6 +40,7 @@ import org.apache.hadoop.util.Daemon; import com.google.common.annotations.VisibleForTesting; + import com.google.common.base.Objects; import com.google.common.base.Preconditions; /** @@ -256,6 +257,16 @@ public boolean expiredSoftLimit() { * @ return the path associated with the pendingFile and null if not found. */ private String findPath(INodeFile pendingFile) { + String retOrig = findPathOrig(pendingFile); + String retNew = pendingFile.getFullPathName(); + if (!Objects.equal(retOrig, retNew)) { + throw new AssertionError( "orig implementation found: " + retOrig + + " new implementation found: " + retNew); + } + return retNew; + } + + private String findPathOrig(INodeFile pendingFile) { try { for ( String src : paths) { INode node = fsnamesystem.dir.getINode(src); that is to say, I try the suggested optimization, along with the original implementation, and verify that they return the same results. I ran all the HDFS tests and they all passed, indicating that it's likely this optimization wouldn't break anything. And, it should be much faster, since it's O(directory depth) instead of O(number of leases held by the client * those lease's directory depths). Anyone have opinions on this? Kihwal Lee or Daryn Sharp maybe? (seem to recall both of you working in this area a few months back)
          Hide
          Todd Lipcon added a comment -

          Looking at the code, it's not obvious why we need to do:

                src = leaseManager.findPath(pendingFile);
          

          as opposed to something like:

                src = pendingFile.getFullPathName();
          

          since in theory the inode path and the lease path should always be kept in sync. Same is true of the same call in commitOrCompleteLastBlock()

          Show
          Todd Lipcon added a comment - Looking at the code, it's not obvious why we need to do: src = leaseManager.findPath(pendingFile); as opposed to something like: src = pendingFile.getFullPathName(); since in theory the inode path and the lease path should always be kept in sync. Same is true of the same call in commitOrCompleteLastBlock()
          Hide
          Todd Lipcon added a comment -

          The handler threads were spending most of their time in this stack trace:

          "IPC Server handler 1 on 8055" daemon prio=10 tid=0x00002ab5c87bc800 nid=0x71dc runnable [0x0000000056e93000]
             java.lang.Thread.State: RUNNABLE
                  at java.util.Collections.indexedBinarySearch(Collections.java:215)
                  at java.util.Collections.binarySearch(Collections.java:201)
                  at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getChildINode(INodeDirectory.java:107)
                  at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getExistingPathINodes(INodeDirectory.java:211)
                  at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getNode(INodeDirectory.java:121)
                  at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getNode(INodeDirectory.java:130)
                  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getINode(FSDirectory.java:1247)
                  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFileINode(FSDirectory.java:1234)
                  at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Lease.findPath(LeaseManager.java:256)
                  at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Lease.access$300(LeaseManager.java:225)
                  at org.apache.hadoop.hdfs.server.namenode.LeaseManager.findPath(LeaseManager.java:186)
                  - locked <0x00002aaac5a38b48> (a org.apache.hadoop.hdfs.server.namenode.LeaseManager)
                  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:3229)
                  at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:560)
          

          (line numbers may be slightly off, since this is from an older release, but code appears to still be structured approximately the same in trunk)

          Show
          Todd Lipcon added a comment - The handler threads were spending most of their time in this stack trace: "IPC Server handler 1 on 8055" daemon prio=10 tid=0x00002ab5c87bc800 nid=0x71dc runnable [0x0000000056e93000] java.lang. Thread .State: RUNNABLE at java.util.Collections.indexedBinarySearch(Collections.java:215) at java.util.Collections.binarySearch(Collections.java:201) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getChildINode(INodeDirectory.java:107) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getExistingPathINodes(INodeDirectory.java:211) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getNode(INodeDirectory.java:121) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.getNode(INodeDirectory.java:130) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getINode(FSDirectory.java:1247) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFileINode(FSDirectory.java:1234) at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Lease.findPath(LeaseManager.java:256) at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Lease.access$300(LeaseManager.java:225) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.findPath(LeaseManager.java:186) - locked <0x00002aaac5a38b48> (a org.apache.hadoop.hdfs.server.namenode.LeaseManager) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:3229) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:560) (line numbers may be slightly off, since this is from an older release, but code appears to still be structured approximately the same in trunk)

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development