Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13904

ContentSummary does not always respect processing limit, resulting in long lock acquisitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • hdfs, namenode
    • None

    Description

      HDFS-4995 added a config dfs.content-summary.limit which allows for an administrator to set a limit on the number of entries processed during a single acquisition of the FSNamesystemLock during the creation of a content summary. This is useful to prevent very long (multiple seconds) pauses on the NameNode when getContentSummary is called on large directories.

      However, even on versions with HDFS-4995, we have seen warnings like:

      INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read lock held for 9398 ms via
      java.lang.Thread.getStackTrace(Thread.java:1552)
      org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)
      org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188)
      org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486)
      org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109)
      org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679)
      org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642)
      org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656)
      

      happen quite consistently when getContentSummary was called on a large directory on a heavily-loaded NameNode. Such long pauses completely destroy the performance of the NameNode. We have the limit set to its default of 5000; if it was respected, clearly there would not be a 10-second pause.

      The current yield() code within ContentSummaryComputationContext looks like:

        public boolean yield() {
          // Are we set up to do this?
          if (limitPerRun <= 0 || dir == null || fsn == null) {
            return false;
          }
      
          // Have we reached the limit?
          long currentCount = counts.getFileCount() +
              counts.getSymlinkCount() +
              counts.getDirectoryCount() +
              counts.getSnapshotableDirectoryCount();
          if (currentCount <= nextCountLimit) {
            return false;
          }
      
          // Update the next limit
          nextCountLimit = currentCount + limitPerRun;
      
          boolean hadDirReadLock = dir.hasReadLock();
          boolean hadDirWriteLock = dir.hasWriteLock();
          boolean hadFsnReadLock = fsn.hasReadLock();
          boolean hadFsnWriteLock = fsn.hasWriteLock();
      
          // sanity check.
          if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
              hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
              fsn.getReadHoldCount() != 1) {
            // cannot relinquish
            return false;
          }
      
          // unlock
          dir.readUnlock();
          fsn.readUnlock("contentSummary");
      
          try {
            Thread.sleep(sleepMilliSec, sleepNanoSec);
          } catch (InterruptedException ie) {
          } finally {
            // reacquire
            fsn.readLock();
            dir.readLock();
          }
          yieldCount++;
          return true;
        }
      

      We believe that this check in particular is the culprit:

          if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
              hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
              fsn.getReadHoldCount() != 1) {
            // cannot relinquish
            return false;
          }
      

      The content summary computation will only relinquish the lock if it is currently the only holder of the lock. Given the high volume of read requests on a heavily loaded NameNode, especially when unfair locking is enabled, it is likely there may be another holder of the read lock performing some short-lived operation. By refusing to give up the lock in this case, the content summary computation ends up never relinquishing the lock.

      We propose to simply remove the readHoldCount checks from this yield(). This should alleviate the case described above by giving up the read lock and allowing other short-lived operations to complete (while the content summary thread sleeps) so that the lock can finally be given up completely. This has the drawback that sometimes, the content summary may give up the lock unnecessarily, if the read lock is never actually released by the time the thread continues again. The only negative impact from this is to make some large content summary operations slightly slower, with the tradeoff of reducing NameNode-wide performance impact.

      Attachments

        Activity

          People

            xkrogen Erik Krogen
            xkrogen Erik Krogen
            Votes:
            1 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated: