Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1377

Quota bug for partial blocks allows quotas to be violated

    Details

    • Hadoop Flags:
      Reviewed

      Description

      There's a bug in the quota code that causes them not to be respected when a file is not an exact multiple of the block size. Here's an example:

      $ hadoop fs -mkdir /test
      $ hadoop dfsadmin -setSpaceQuota 384M /test
      $ ls dir/ | wc -l   # dir contains 101 files
      101
      $ du -ms dir        # each is 3mb
      304	dir
      $ hadoop fs -put dir /test
      $ hadoop fs -count -q /test
              none             inf       402653184      -550502400            2          101          317718528 hdfs://haus01.sf.cloudera.com:10020/test
      $ hadoop fs -stat "%o %r" /test/dir/f30
      134217728 3    # three 128mb blocks
      

      INodeDirectoryWithQuota caches the number of bytes consumed by it's children in diskspace. The quota adjustment code has a bug that causes diskspace to get updated incorrectly when a file is not an exact multiple of the block size (the value ends up being negative).

      This causes the quota checking code to think that the files in the directory consumes less space than they actually do, so the verifyQuota does not throw a QuotaExceededException even when the directory is over quota. However the bug isn't visible to users because fs count -q reports the numbers generated by INode#getContentSummary which adds up the sizes of the blocks rather than use the cached INodeDirectoryWithQuota#diskspace value.

      In FSDirectory#addBlock the disk space consumed is set conservatively to the full block size * the number of replicas:

      updateCount(inodes, inodes.length-1, 0,
          fileNode.getPreferredBlockSize()*fileNode.getReplication(), true);
      

      In FSNameSystem#addStoredBlock we adjust for this conservative estimate by subtracting out the difference between the conservative estimate and what the number of bytes actually stored was:

      //Updated space consumed if required.
      INodeFile file = (storedBlock != null) ? storedBlock.getINode() : null;
      long diff = (file == null) ? 0 :
          (file.getPreferredBlockSize() - storedBlock.getNumBytes());
      
      if (diff > 0 && file.isUnderConstruction() &&
          cursize < storedBlock.getNumBytes()) {
      ...
          dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());
      

      We do the same in FSDirectory#replaceNode when completing the file, but at a file granularity (I believe the intent here is to correct for the cases when there's a failure replicating blocks and recovery). Since oldnode is under construction INodeFile#diskspaceConsumed will use the preferred block size (vs of Block#getNumBytes used by newnode) so we will again subtract out the difference between the full block size and what the number of bytes actually stored was:

      long dsOld = oldnode.diskspaceConsumed();
      ...
      //check if disk space needs to be updated.
      long dsNew = 0;
      if (updateDiskspace && (dsNew = newnode.diskspaceConsumed()) != dsOld) {
        try {
          updateSpaceConsumed(path, 0, dsNew-dsOld);
      ...
      

      So in the above example we started with diskspace at 384mb (3 * 128mb) and then we subtract 375mb (to reflect only 9mb raw was actually used) twice so for each file the diskspace for the directory is - 366mb (384mb minus 2 * 375mb). Which is why the quota gets negative and yet we can still write more files.

      So a directory with lots of single block files (if you have multiple blocks on the final partial block ends up subtracting from the diskspace used) ends up having a quota that's way off.

      I think the fix is to in FSDirectory#replaceNode not have the diskspaceConsumed calculations differ when the old and new INode have the same blocks. I'll work on a patch which also adds a quota test for blocks that are not multiples of the block size and warns in INodeDirectory#computeContentSummary if the computed size does not reflect the cached value.

      1. hdfs-1377-1.patch
        10 kB
        Eli Collins
      2. hdfs-1377-b20-3.patch
        8 kB
        Eli Collins
      3. hdfs-1377-b20-2.patch
        9 kB
        Eli Collins
      4. hdfs-1377-b20-1.patch
        10 kB
        Eli Collins
      5. HDFS-1377.patch
        2 kB
        Zhong Wang

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have merged this to branch-0.20-security-204.

          Show
          Tsz Wo Nicholas Sze added a comment - I have merged this to branch-0.20-security-204.
          Hide
          Eli Collins added a comment -

          Weird. You're right, this change is not checked into branch-0.20-security but according to the git mirror it is. The following is from before Nicholas' push:

          common1 (branch-0.20-security-204)$ git log origin/branch-0.20-security-204 |grep HDFS-1377
              HDFS-1377. Quota bug for partial blocks allows quotas to be violated. Contributed by Eli Collins
          common1 (branch-0.20-security-204)$  
          

          Is this because the change was checked into branch-0.20 and the mirror things branch-0.20-security is based on branch-0.20?

          Show
          Eli Collins added a comment - Weird. You're right, this change is not checked into branch-0.20-security but according to the git mirror it is. The following is from before Nicholas' push: common1 (branch-0.20-security-204)$ git log origin/branch-0.20-security-204 |grep HDFS-1377 HDFS-1377. Quota bug for partial blocks allows quotas to be violated. Contributed by Eli Collins common1 (branch-0.20-security-204)$ Is this because the change was checked into branch-0.20 and the mirror things branch-0.20-security is based on branch-0.20?
          Hide
          John George added a comment -

          Thanks Nicholas and Eli. The fix is now in branch-20-security. I ran the related tests and it passed.

          Show
          John George added a comment - Thanks Nicholas and Eli. The fix is now in branch-20-security. I ran the related tests and it passed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I merged this to 0.20-security.

          Eli, it is not yet in branch-0.20-security-204 but thanks for taking a look.

          John, could you test 0.20-security branch one more time?

          Show
          Tsz Wo Nicholas Sze added a comment - I merged this to 0.20-security. Eli, it is not yet in branch-0.20-security-204 but thanks for taking a look. John, could you test 0.20-security branch one more time?
          Hide
          John George added a comment -

          I am sure I am missing something here, but in my "view" or branch-20-security, I do not see the patch. I am looking at:
          http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security

          Show
          John George added a comment - I am sure I am missing something here, but in my "view" or branch-20-security, I do not see the patch. I am looking at: http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security
          Hide
          Eli Collins added a comment -

          It is already checked into branch-0.20-security and branch-0.20-security-204. Updating the fix version to match. Would be great if this were done for the other jiras in the branch.

          Show
          Eli Collins added a comment - It is already checked into branch-0.20-security and branch-0.20-security-204. Updating the fix version to match. Would be great if this were done for the other jiras in the branch.
          Hide
          John George added a comment -

          Can someone (Nicholas?) commit the 20 patch in this JIRA to branch-0.20-security?

          Show
          John George added a comment - Can someone (Nicholas?) commit the 20 patch in this JIRA to branch-0.20-security?
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Committed also this to Federation Branch. Thanks, Eli!

          Show
          Tsz Wo Nicholas Sze added a comment - Committed also this to Federation Branch. Thanks, Eli!
          Hide
          Eli Collins added a comment -

          Thanks Raghu! I've committed this to trunk, and the 22, 21, and 20 branches.

          Show
          Eli Collins added a comment - Thanks Raghu! I've committed this to trunk, and the 22, 21, and 20 branches.
          Hide
          Raghu Angadi added a comment -

          +1.

          Thanks Eli. The warning in contentSummary() is also useful.

          Show
          Raghu Angadi added a comment - +1. Thanks Eli. The warning in contentSummary() is also useful.
          Hide
          Raghu Angadi added a comment -

          Thanks Eli. will review the patch tonight (tomorrow night at the latest).

          Show
          Raghu Angadi added a comment - Thanks Eli. will review the patch tonight (tomorrow night at the latest).
          Hide
          Eli Collins added a comment -

          Patch for trunk attached. Release audit warnings are a separate issue. No change in test failures.

               [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 9 new or modifi
          ed tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messa
          ges.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number 
          of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (ver
          sion 1.3.9) warnings.
               [exec] 
               [exec]     -1 release audit.  The applied patch generated 103 release audit
           warnings (more than the trunk's current 98 warnings).
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
               [exec] 
          
          Show
          Eli Collins added a comment - Patch for trunk attached. Release audit warnings are a separate issue. No change in test failures. [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modifi ed tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messa ges. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (ver sion 1.3.9) warnings. [exec] [exec] -1 release audit. The applied patch generated 103 release audit warnings (more than the trunk's current 98 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec]
          Hide
          Eli Collins added a comment -

          Patch for branch 20 attached. Address Raghu's feedback. All test pass modulo TestDU which is HADOOP-7045. Think patch is ready to be checked in.

               [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories.
               [exec] 
          

          Not sure what the eclipse classpath warning is about, this patch just edits some source files.

          Show
          Eli Collins added a comment - Patch for branch 20 attached. Address Raghu's feedback. All test pass modulo TestDU which is HADOOP-7045 . Think patch is ready to be checked in. [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories. [exec] Not sure what the eclipse classpath warning is about, this patch just edits some source files.
          Hide
          Eli Collins added a comment -

          Removing the disk space updating code in replaceNode also looks safe in trunk, though I noticed we're missing coverage for quotas and append. Filed HDFS-1515.

          Show
          Eli Collins added a comment - Removing the disk space updating code in replaceNode also looks safe in trunk, though I noticed we're missing coverage for quotas and append. Filed HDFS-1515 .
          Hide
          Eli Collins added a comment -

          Thanks for taking a look Raghu!

          Not sure how it manifests in real life, but here a file is going from INodeFile to INodeFIleUnderConstruction. So the actual space consumed should be rounded upwards to the block boundery. Add stored block is not called at this time.

          The sequence of events is:

          • A client starts appending to an existing file
          • dfsadmin -saveNamespace (persists the INodeFileUnderConstruction)
          • The NN restarts, when loading the image we replace the old INode with the persisted INodeFileUnderConstruction.

          Since disk usage is not persisted in the image, is calculated as DNs report blocks on startup, I don't think this path should update disk usage.

          How about calling replaceNode(..., updateDiskspace = false) in finalizeFileUnderconstruction(), since we know that oldNode is not under construction anymore?

          I think we can remove the disk updating code there entirely, even on trunk, and add tests to make sure the block-level accounting is working.

          btw, do we separate test file for this? TestQuota.java is already supposed to test violations (FWIW, it does test some violations).

          I had the put them in a separate class so they could share setup and teardown (vs TestQuota which has setup/teardown code for each test). I'll update the patches to add the tests to TestQuota so they're all in one place.

          Show
          Eli Collins added a comment - Thanks for taking a look Raghu! Not sure how it manifests in real life, but here a file is going from INodeFile to INodeFIleUnderConstruction. So the actual space consumed should be rounded upwards to the block boundery. Add stored block is not called at this time. The sequence of events is: A client starts appending to an existing file dfsadmin -saveNamespace (persists the INodeFileUnderConstruction) The NN restarts, when loading the image we replace the old INode with the persisted INodeFileUnderConstruction. Since disk usage is not persisted in the image, is calculated as DNs report blocks on startup, I don't think this path should update disk usage. How about calling replaceNode(..., updateDiskspace = false) in finalizeFileUnderconstruction(), since we know that oldNode is not under construction anymore? I think we can remove the disk updating code there entirely, even on trunk, and add tests to make sure the block-level accounting is working. btw, do we separate test file for this? TestQuota.java is already supposed to test violations (FWIW, it does test some violations). I had the put them in a separate class so they could share setup and teardown (vs TestQuota which has setup/teardown code for each test). I'll update the patches to add the tests to TestQuota so they're all in one place.
          Hide
          Raghu Angadi added a comment -

          I am still refreshing my memory of this NN internals...

          I think the patch is correct. I too am not sure if this causes problem else where. looks like it fixes more that it might break.

          This patch fixes bug the case where replaceNode(oldINode, newINode) is called under normal close().. where though oldINode is a INodeUnderConstruction only by its Java type, but not logically (it is already been removed from the active leases etc).

          What about the other way around where newINode is an INodeUnderConstruction? as described by Eli below :

          So the question becomes do oldnode and newnode ever have different blocks? On branch 20 I don't think that's the case. The only caller where it seems that it could potentially be the case is loadFilesUnderConstruction, in which case the under construction INode may have an extra block, but then wouldn't that have been accounted for via addStoredBlock already?

          Not sure how it manifests in real life, but here a file is going from INodeFile to INodeFIleUnderConstruction. So the actual space consumed should be rounded upwards to the block boundery. Add stored block is not called at this time.

          How about calling replaceNode(..., updateDiskspace = false) in finalizeFileUnderconstruction(), since we know that oldNode is not under construction anymore?

          btw, do we separate test file for this? TestQuota.java is already supposed to test violations (FWIW, it does test some violations).

          Show
          Raghu Angadi added a comment - I am still refreshing my memory of this NN internals... I think the patch is correct. I too am not sure if this causes problem else where. looks like it fixes more that it might break. This patch fixes bug the case where replaceNode(oldINode, newINode) is called under normal close().. where though oldINode is a INodeUnderConstruction only by its Java type, but not logically (it is already been removed from the active leases etc). What about the other way around where newINode is an INodeUnderConstruction? as described by Eli below : So the question becomes do oldnode and newnode ever have different blocks? On branch 20 I don't think that's the case. The only caller where it seems that it could potentially be the case is loadFilesUnderConstruction, in which case the under construction INode may have an extra block, but then wouldn't that have been accounted for via addStoredBlock already? Not sure how it manifests in real life, but here a file is going from INodeFile to INodeFIleUnderConstruction. So the actual space consumed should be rounded upwards to the block boundery. Add stored block is not called at this time. How about calling replaceNode(..., updateDiskspace = false) in finalizeFileUnderconstruction(), since we know that oldNode is not under construction anymore? btw, do we separate test file for this? TestQuota.java is already supposed to test violations (FWIW, it does test some violations).
          Hide
          Eli Collins added a comment -

          Right patch for branch 20 attached this time.

          Show
          Eli Collins added a comment - Right patch for branch 20 attached this time.
          Hide
          Eli Collins added a comment -

          Patch for branch 20 attached. I think this bug is bad enough to fix on this branch.

          Apologies, I thought I had posted this earlier.

          • Adds TestQuotaViolations which covers these particular cases
          • Removes the account in replaceNode (which fixes these particular bugs), similar fix to the patch Zhong created
          • Adds a warning in INodeDirectory#computeContent summary to warn against future cases. I've run this patch on a cluster and not seen it warn, but it will catch cases like HDFS-1487.
          Show
          Eli Collins added a comment - Patch for branch 20 attached. I think this bug is bad enough to fix on this branch. Apologies, I thought I had posted this earlier. Adds TestQuotaViolations which covers these particular cases Removes the account in replaceNode (which fixes these particular bugs), similar fix to the patch Zhong created Adds a warning in INodeDirectory#computeContent summary to warn against future cases. I've run this patch on a cluster and not seen it warn, but it will catch cases like HDFS-1487 .
          Hide
          Eli Collins added a comment -

          So what is the original idea of the code to be removed in your patch?

          If the oldnode and newnode arguments to replaceNode always have the same blocks then the disk updating space code in replaceNode is redundant. This code goes back to the original space quota addition (HADOOP-3938), and it looks like it was redundant as well there (note that the comment near the code even says "Currently oldnode and newnode are assumed to contain the same blocks" so perhaps they were trying to be conservative).

          So the question becomes do oldnode and newnode ever have different blocks? On branch 20 I don't think that's the case. The only caller where it seems that it could potentially be the case is loadFilesUnderConstruction, in which case the under construction INode may have an extra block, but then wouldn't that have been accounted for via addStoredBlock already?

          I still need to look at the code for trunk more thoroughly, the latest code, particularly append, makes things less obvious. I do think we should remove this particular file-level accounting code in replaceNode, and always make quota adjustments at block granularity.

          Show
          Eli Collins added a comment - So what is the original idea of the code to be removed in your patch? If the oldnode and newnode arguments to replaceNode always have the same blocks then the disk updating space code in replaceNode is redundant. This code goes back to the original space quota addition ( HADOOP-3938 ), and it looks like it was redundant as well there (note that the comment near the code even says "Currently oldnode and newnode are assumed to contain the same blocks" so perhaps they were trying to be conservative). So the question becomes do oldnode and newnode ever have different blocks? On branch 20 I don't think that's the case. The only caller where it seems that it could potentially be the case is loadFilesUnderConstruction, in which case the under construction INode may have an extra block, but then wouldn't that have been accounted for via addStoredBlock already? I still need to look at the code for trunk more thoroughly, the latest code, particularly append, makes things less obvious. I do think we should remove this particular file-level accounting code in replaceNode, and always make quota adjustments at block granularity.
          Hide
          Weichao Ruan added a comment -

          Zhong,

          Thanks. In fact, i encountered the same problem in our cluster.
          And i also test your patch and it looks good.
          +1 for this patch.

          Show
          Weichao Ruan added a comment - Zhong, Thanks. In fact, i encountered the same problem in our cluster. And i also test your patch and it looks good. +1 for this patch.
          Hide
          Zhong Wang added a comment -

          Weichao,

          The reason is the same with the issue description from Eli Collins. When finalizing an uncompleted file, BlockManager reduce the pending block size in commitBlock() while FSDirecotry.replaceNode() do the same thing. Diskspace count will be reduced twice and less than the actual value, even more, the value maybe nagetive which is obviously unleagle.

          You can reproduce the problem using the steps from Eli Collins. Or make a regression test using my test case.

          Show
          Zhong Wang added a comment - Weichao, The reason is the same with the issue description from Eli Collins. When finalizing an uncompleted file, BlockManager reduce the pending block size in commitBlock() while FSDirecotry.replaceNode() do the same thing. Diskspace count will be reduced twice and less than the actual value, even more, the value maybe nagetive which is obviously unleagle. You can reproduce the problem using the steps from Eli Collins. Or make a regression test using my test case.
          Hide
          Weichao Ruan added a comment -

          So what is the original idea of the code to be removed in your patch?

          Show
          Weichao Ruan added a comment - So what is the original idea of the code to be removed in your patch?
          Hide
          Zhong Wang added a comment -

          Patch attached and run hudson.

          I removed disk checking in FSDirectory.replaceNode(). It is duplicated with BlockManager.commitBlock(). All diskspace counting will be updated at a block level when writing (appending) a file, in addBlock(), commitBlock() and abandonBlock() (HDFS-1487).

          I also added a test case in TestQuota. The original tests hiding this issue because all file length is mutiple of 1024 while block size is 512. I create a file which is not an exact multiple of the block size to test this case.

          Hadoop developers and committers, thanks for your time to review this patch and give me some suggestions! HDFS quota is very important to manage our clusters which is highly shareable with dozens of users and PBs storage.

          Show
          Zhong Wang added a comment - Patch attached and run hudson. I removed disk checking in FSDirectory.replaceNode(). It is duplicated with BlockManager.commitBlock(). All diskspace counting will be updated at a block level when writing (appending) a file, in addBlock(), commitBlock() and abandonBlock() ( HDFS-1487 ). I also added a test case in TestQuota. The original tests hiding this issue because all file length is mutiple of 1024 while block size is 512. I create a file which is not an exact multiple of the block size to test this case. Hadoop developers and committers, thanks for your time to review this patch and give me some suggestions! HDFS quota is very important to manage our clusters which is highly shareable with dozens of users and PBs storage.
          Hide
          Zhong Wang added a comment -

          I made a link from HDFS-1487. I found the same problem when testing the issue. I think I can fix this soon and attach a patch. Any comments?

          Show
          Zhong Wang added a comment - I made a link from HDFS-1487 . I found the same problem when testing the issue. I think I can fix this soon and attach a patch. Any comments?

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development