Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-2053

Bug in INodeDirectory#computeContentSummary warning

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.20.3, 0.20.204.0, 0.20.205.0
    • 0.20.205.0, 0.23.0
    • namenode
    • None
    • Reviewed

    Description

      How to reproduce

      # create test directories
      $ hadoop fs -mkdir /hdfs-1377/A
      $ hadoop fs -mkdir /hdfs-1377/B
      $ hadoop fs -mkdir /hdfs-1377/C
      
      # ...add some test data (few kB or MB) to all three dirs...
      
      # set space quota for subdir C only
      $ hadoop dfsadmin -setSpaceQuota 1g /hdfs-1377/C
      
      # the following two commands _on the parent dir_ trigger the warning
      $ hadoop fs -dus /hdfs-1377
      $ hadoop fs -count -q /hdfs-1377
      

      Warning message in the namenode logs:

      2011-06-09 09:42:39,817 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Inconsistent diskspace for directory C. Cached: 433872320 Computed: 438465355
      

      Note that the commands are run on the parent directory but the warning is shown for the subdirectory with space quota.

      Background
      The bug was introduced by the HDFS-1377 patch, which is currently committed to at least branch-0.20, branch-0.20-security, branch-0.20-security-204, branch-0.20-security-205 and release-0.20.3-rc2. In the patch, src/hdfs/org/apache/hadoop/hdfs/server/namenode/INodeDirectory.java was updated to trigger the warning above if the cached and computed diskspace values are not the same for a directory with quota.

      The warning is written by computecontentSummary(long[] summary) in INodeDirectory. In the method an inode's children are recursively walked through while the summary parameter is passed and updated along the way.

        /** {@inheritDoc} */
        long[] computeContentSummary(long[] summary) {
          if (children != null) {
            for (INode child : children) {
              child.computeContentSummary(summary);
            }
          }
      

      The condition that triggers the warning message compares the current node's cached diskspace (via node.diskspaceConsumed()) with the corresponding field in summary.

            if (-1 != node.getDsQuota() && space != summary[3]) {
              NameNode.LOG.warn("Inconsistent diskspace for directory "
                +getLocalName()+". Cached: "+space+" Computed: "+summary[3]);
      

      However summary may already include diskspace information from other inodes at this point (i.e. from different subtrees than the subtree of the node for which the warning message is shown; in our example for the tree at /hdfs-1377, summary can already contain information from /hdfs-1377/A and /hdfs-1377/B when it is passed to inode /hdfs-1377/C). Hence the cached value for C can incorrectly be different from the computed value.

      How to fix

      The supplied patch creates a fresh summary array for the subtree of the current node. The walk through the children passes and updates this subtreeSummary array, and the condition is checked against subtreeSummary instead of the original summary. The original summary is updated with the values of subtreeSummary before it returns.

      Unit Tests

      I have run "ant test" on my patched build without any errors*. However the existing unit tests did not catch this issue for the original HDFS-1377 patch, so this might not mean anything.

      That said I am unsure what the most appropriate way to unit test this issue would be. A straight-forward approach would be to automate the steps in the How to reproduce section above and check whether the NN logs an incorrect warning message. But I'm not sure how this check could be implemented. Feel free to provide some pointers if you have some ideas.

      Note about Fix Version/s

      The patch should apply to all branches where the HDFS-1377 patch has committed to. In my environment, the build was Hadoop 0.20.203.0 release with a (trivial) backport of HDFS-1377 (0.20.203.0 release does not ship with the HDFS-1377 fix). I could apply the patch successfully to branch-0.20-security, branch-0.20-security-204 and release-0.20.3-rc2, for instance. Since I'm a bit confused regarding the upcoming 0.20.x release versions (0.20.x vs. 0.20.20x.y) I have been so bold and added 0.20.203.0 to the list of affected versions even though it is actually only affected when HDFS-1377 is applied to it...

      Best,
      Michael

      *Well, I get one error for TestRumenJobTraces but first this seems to be completely unrelated and second I get the same test error when running the tests on the stock 0.20.203.0 release build.

      Attachments

        1. HDFS-2053_v1.txt
          2 kB
          Michael G. Noll
        2. HDFS-2053_v2.txt
          2 kB
          Michael G. Noll
        3. HDFS-2053_v3.txt
          4 kB
          Michael G. Noll
        4. hdfs-2053_v3-b20.patch
          4 kB
          Eli Collins

        Issue Links

          Activity

            People

              miguno Michael G. Noll
              miguno Michael G. Noll
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: