Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1348

Improve NameNode reponsiveness while it is checking if datanode decommissions are complete

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      NameNode normally is busy all the time. Its log is full of activities every second. But once for a while, NameNode seems to pause for more than 10 seconds without doing anything, leaving a blank in its log even though no garbage collection is happening. All other requests to NameNode are blocked when this is happening.

      One culprit is DecommionManager. Its monitor holds the fsynamesystem lock during the whole process of checking if decomissioning DataNodes are finished or not, during which it checks every block of up to a default of 5 datanodes.

      1. decomissionImp2.patch
        11 kB
        Hairong Kuang
      2. decomissionImp1.patch
        11 kB
        Hairong Kuang
      3. decommission1.patch
        10 kB
        Hairong Kuang
      4. decommission.patch
        10 kB
        Hairong Kuang

        Activity

        Hide
        Hairong Kuang added a comment -

        Read more code to understand how check works. Currently check iterating up to a default of 5 datanodes to see if they are done with decommission or not. It is easy to release the global fsnamesytem lock in between checking two nodes.

        When checking one datanode, it transverses all its blocks to see if they are replicated during which the global lock is held. If a datanode has like 100K blocks, this is going to be a non-trivial amount of time. Any idea for breaking this into smaller chunks without sacrificing accuracy?

        Show
        Hairong Kuang added a comment - Read more code to understand how check works. Currently check iterating up to a default of 5 datanodes to see if they are done with decommission or not. It is easy to release the global fsnamesytem lock in between checking two nodes. When checking one datanode, it transverses all its blocks to see if they are replicated during which the global lock is held. If a datanode has like 100K blocks, this is going to be a non-trivial amount of time. Any idea for breaking this into smaller chunks without sacrificing accuracy?
        Hide
        Hairong Kuang added a comment -

        Here is the proposed change.

        Currently the check is logic is as following:

        synchronized (fsnamesystem) {
          for up to five decommission in progress datanodes
             check each datanode if all its blocks have replicated;
        }
        

        I plan to change the structure to be:

        for up to five iterations {
          synchronized (fsnamesystem) {
            node = get next decommission in progress node;
          }
        
         do {
           synchronized (fsnamesystem) {
             fetch up to 2000 unchecked blocks from node;
           }
           for each block b
             synchronized (fsnamesystem) {
               check if block b has replicated;
            }
         } until all blocks of node have checked;
        }
        

        This proposed restructure will make the locking granularity much smaller. This should improve NameNode's responsiveness when decommissioning check occurs.

        Show
        Hairong Kuang added a comment - Here is the proposed change. Currently the check is logic is as following: synchronized (fsnamesystem) { for up to five decommission in progress datanodes check each datanode if all its blocks have replicated; } I plan to change the structure to be: for up to five iterations { synchronized (fsnamesystem) { node = get next decommission in progress node; } do { synchronized (fsnamesystem) { fetch up to 2000 unchecked blocks from node; } for each block b synchronized (fsnamesystem) { check if block b has replicated; } } until all blocks of node have checked; } This proposed restructure will make the locking granularity much smaller. This should improve NameNode's responsiveness when decommissioning check occurs.
        Hide
        dhruba borthakur added a comment -

        +1. Is it possible to do the check with fsnamesystem-readlock only.

        Show
        dhruba borthakur added a comment - +1. Is it possible to do the check with fsnamesystem-readlock only.
        Hide
        Hairong Kuang added a comment -

        For r/w locks, the first & second cases could use the read lock and the third case has to use the write lock when it updates some stats of the decommissioning node.

        Show
        Hairong Kuang added a comment - For r/w locks, the first & second cases could use the read lock and the third case has to use the write lock when it updates some stats of the decommissioning node.
        Hide
        Hairong Kuang added a comment -

        Here is a patch for review.

        Show
        Hairong Kuang added a comment - Here is a patch for review.
        Hide
        Dmytro Molkov added a comment -

        I like the patch overall. Got two questions though: why did you choose 2000 blocks to check at a time and do you think it might be worth to make it configurable? And the other one is how much memory pressure would the checkedBlocks hash set add?

        Show
        Dmytro Molkov added a comment - I like the patch overall. Got two questions though: why did you choose 2000 blocks to check at a time and do you think it might be worth to make it configurable? And the other one is how much memory pressure would the checkedBlocks hash set add?
        Hide
        Hairong Kuang added a comment -

        yes, we should restrict the scanning overhead by putting an up limit on the number of block fetches. So I would propose to set the number of blocks to be fetched to
        max(2000, total # of blocks/5)
        So the one check approximately scans the block list up to 5 times.

        As the checkedBlock overhead, I plan to use LightweightGSet introduced by HDFS-1114.

        Show
        Hairong Kuang added a comment - yes, we should restrict the scanning overhead by putting an up limit on the number of block fetches. So I would propose to set the number of blocks to be fetched to max(2000, total # of blocks/5) So the one check approximately scans the block list up to 5 times. As the checkedBlock overhead, I plan to use LightweightGSet introduced by HDFS-1114 .
        Hide
        Hairong Kuang added a comment -

        This patch addressed Dmytro's review comment.

        Show
        Hairong Kuang added a comment - This patch addressed Dmytro's review comment.
        Hide
        Hairong Kuang added a comment -

        This new patch groups the block check in 1000 a batch so reducing the overhead of locking comparing to the previous patch.

        Show
        Hairong Kuang added a comment - This new patch groups the block check in 1000 a batch so reducing the overhead of locking comparing to the previous patch.
        Hide
        Hairong Kuang added a comment -

        Change the patch to use read/write lock.

        Show
        Hairong Kuang added a comment - Change the patch to use read/write lock.
        Hide
        Dmytro Molkov added a comment -

        The patch looks good to me. +1

        Show
        Dmytro Molkov added a comment - The patch looks good to me. +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12455885/decomissionImp2.patch
        against trunk revision 1051669.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
        org.apache.hadoop.hdfs.TestFileConcurrentReader
        org.apache.hadoop.hdfs.TestHDFSTrash

        -1 contrib tests. The patch failed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/22//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/22//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/22//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12455885/decomissionImp2.patch against trunk revision 1051669. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestStorageRestore org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/22//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/22//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/22//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12455885/decomissionImp2.patch
        against trunk revision 1072023.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/189//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12455885/decomissionImp2.patch against trunk revision 1072023. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/189//console This message is automatically generated.

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Hairong Kuang
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development