HBase
  1. HBase
  2. HBASE-5635

If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.1
    • Fix Version/s: 0.94.0, 0.95.0
    • Component/s: wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      During the hlog split operation if all the zookeepers are down ,then the paths will be returned as null and the splitworker thread wil be exited
      Now this regionserver wil not be able to acquire any other tasks since the splitworker thread is exited
      Please find the attached code for more details

      private List<String> getTaskList() {
          for (int i = 0; i < zkretries; i++) {
            try {
              return (ZKUtil.listChildrenAndWatchForNewChildren(this.watcher,
                  this.watcher.splitLogZNode));
            } catch (KeeperException e) {
              LOG.warn("Could not get children of znode " +
                  this.watcher.splitLogZNode, e);
              try {
                Thread.sleep(1000);
              } catch (InterruptedException e1) {
                LOG.warn("Interrupted while trying to get task list ...", e1);
                Thread.currentThread().interrupt();
                return null;
              }
            }
          }
      

      in the org.apache.hadoop.hbase.regionserver.SplitLogWorker

      1. HBASE-5635_0.94.patch
        2 kB
        ramkrishna.s.vasudevan
      2. HBASE-5635._trunk.patch
        2 kB
        ramkrishna.s.vasudevan
      3. HBASE-5635.2.patch
        2 kB
        Chinna Rao Lalam
      4. HBASE-5635.1.patch
        2 kB
        Chinna Rao Lalam
      5. HBASE-5635.patch
        2 kB
        Chinna Rao Lalam

        Activity

        Hide
        Uma Maheswara Rao G added a comment -

        Yes, I think, continuing without SplitLogWroker may not be a good behaviour.
        Because that particular regionServer may have more capacity to take up the new regions. With the current behaviour it may not compete for taking any new splilog work.

        I feel we can retry for some times and then we can shutdown regionServer?
        or other option is to retry forever on any ZK exception. And can exit only on interrupted exception.

        Also i am seeing this issue may be bit dangerous bacause, if ZK is not available for some time, all RegionServer may face this problem and no one will take up the splitlog work.

        listChildrenAndWatchForNewChildren will return null only if node does not exist. If it is not able to find any children then it will return empty list. So, zookeeper.znode.splitlog will be always set.

        On Other keeperExceptions like ZK unavalability and all, we have to handle.

        Show
        Uma Maheswara Rao G added a comment - Yes, I think, continuing without SplitLogWroker may not be a good behaviour. Because that particular regionServer may have more capacity to take up the new regions. With the current behaviour it may not compete for taking any new splilog work. I feel we can retry for some times and then we can shutdown regionServer? or other option is to retry forever on any ZK exception. And can exit only on interrupted exception. Also i am seeing this issue may be bit dangerous bacause, if ZK is not available for some time, all RegionServer may face this problem and no one will take up the splitlog work. listChildrenAndWatchForNewChildren will return null only if node does not exist. If it is not able to find any children then it will return empty list. So, zookeeper.znode.splitlog will be always set. On Other keeperExceptions like ZK unavalability and all, we have to handle.
        Hide
        Chinna Rao Lalam added a comment -

        In SplitLogWorker.taskLoop() if getTaskList() returns null it is comming out form taskLoop(). Here when ever getTaskList() return null this thread should wait and when ever it will be notified it will continue the work. So splitLogWorker thread wont be exited

        Attached the patch with this change.

        Show
        Chinna Rao Lalam added a comment - In SplitLogWorker.taskLoop() if getTaskList() returns null it is comming out form taskLoop(). Here when ever getTaskList() return null this thread should wait and when ever it will be notified it will continue the work. So splitLogWorker thread wont be exited Attached the patch with this change.
        Hide
        Ted Yu added a comment -

        Please change the wording in the following log:

        +        LOG.warn("Could not get tasks, did someone remove "
        +            + this.watcher.splitLogZNode + " ... worker thread exiting.");
        

        In this case the worker thread is not exiting.

        Show
        Ted Yu added a comment - Please change the wording in the following log: + LOG.warn( "Could not get tasks, did someone remove " + + this .watcher.splitLogZNode + " ... worker thread exiting." ); In this case the worker thread is not exiting.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Chinna

        private List<String> getTaskList() {
            for (int i = 0; i < zkretries; i++) {
              try {
                return (ZKUtil.listChildrenAndWatchForNewChildren(this.watcher,
                    this.watcher.splitLogZNode));
              } 
        

        Currently there is no retry mechanism working in getTaskList. Though there is a loop there is no retry.

        One more thing is we can differentiate if really there was a problem in setting a watch by returning null and if the watch was success but there is no child for the split node. In that case i think we can return an empty list.

        Also as the thread is just going for a wait state without watch getting set successfully the thread will keep waiting.

        Show
        ramkrishna.s.vasudevan added a comment - @Chinna private List< String > getTaskList() { for ( int i = 0; i < zkretries; i++) { try { return (ZKUtil.listChildrenAndWatchForNewChildren( this .watcher, this .watcher.splitLogZNode)); } Currently there is no retry mechanism working in getTaskList. Though there is a loop there is no retry. One more thing is we can differentiate if really there was a problem in setting a watch by returning null and if the watch was success but there is no child for the split node. In that case i think we can return an empty list. Also as the thread is just going for a wait state without watch getting set successfully the thread will keep waiting.
        Hide
        Chinna Rao Lalam added a comment -

        Updated the patch with retry logic in getTaskList(). So if getTaskList() after retry if it returns null the splitLogWorker will be exited.

        Show
        Chinna Rao Lalam added a comment - Updated the patch with retry logic in getTaskList(). So if getTaskList() after retry if it returns null the splitLogWorker will be exited.
        Hide
        Ted Yu added a comment -
        +    // It will be in loop till it get the list of children or
        +    // it will come out if worker thread existed.
        

        Typo. First line should read 'it gets the'
        Second line should read 'worker thread exited'.

        +        Thread.sleep(1000);
        

        Please reduce the above interval.

        Do we need a timeout for the newly added loop ?

        Show
        Ted Yu added a comment - + // It will be in loop till it get the list of children or + // it will come out if worker thread existed. Typo. First line should read 'it gets the' Second line should read 'worker thread exited'. + Thread .sleep(1000); Please reduce the above interval. Do we need a timeout for the newly added loop ?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Do we need a timeout for the newly added loop ?

        If we have timeout, then if the zk connection takes a little longer than the timeout how to make the worker thread to be alive? That is the reason an infinite loop was tried.

        Show
        ramkrishna.s.vasudevan added a comment - Do we need a timeout for the newly added loop ? If we have timeout, then if the zk connection takes a little longer than the timeout how to make the worker thread to be alive? That is the reason an infinite loop was tried.
        Hide
        Ted Yu added a comment -

        Then we should log a message saying what we wait for every X minutes so that user doesn't have to use jstack.

        Show
        Ted Yu added a comment - Then we should log a message saying what we wait for every X minutes so that user doesn't have to use jstack.
        Hide
        Chinna Rao Lalam added a comment -

        Updated the patch with log message for retry and corrected the typo.

        Show
        Chinna Rao Lalam added a comment - Updated the patch with log message for retry and corrected the typo.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12521175/HBASE-5635.2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1431//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1431//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1431//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12521175/HBASE-5635.2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1431//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1431//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1431//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Chinna
        I think logging after the sleep makes sense and also addresses Ted's concern.

        Show
        ramkrishna.s.vasudevan added a comment - @Chinna I think logging after the sleep makes sense and also addresses Ted's concern.
        Hide
        Anoop Sam John added a comment -

        As per the patch the below variable is of no use now

        this.zkretries = conf.getLong("hbase.splitlog.zk.retries", 3);
        
        Show
        Anoop Sam John added a comment - As per the patch the below variable is of no use now this .zkretries = conf.getLong( "hbase.splitlog.zk.retries" , 3);
        Hide
        Ted Yu added a comment -

        @Anoop:
        "hbase.splitlog.zk.retries" is used by SplitLogManager

        Does anyone have comment about latest patch ?

        Show
        Ted Yu added a comment - @Anoop: "hbase.splitlog.zk.retries" is used by SplitLogManager Does anyone have comment about latest patch ?
        Hide
        ramkrishna.s.vasudevan added a comment -

        @devs
        I plan to commit this to trunk tomorrow. Pls provide your comments.

        Show
        ramkrishna.s.vasudevan added a comment - @devs I plan to commit this to trunk tomorrow. Pls provide your comments.
        Hide
        Lars Hofhansl added a comment -

        +1 on patch. I wonder how many more problems like this we have lurking in the worker threads.
        Can you do a 0.94.1 patch as well?

        Show
        Lars Hofhansl added a comment - +1 on patch. I wonder how many more problems like this we have lurking in the worker threads. Can you do a 0.94.1 patch as well?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks Lars. Sure i will make a patch for 0.94 as well and commit both tomorrow.

        Show
        ramkrishna.s.vasudevan added a comment - Thanks Lars. Sure i will make a patch for 0.94 as well and commit both tomorrow.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Reattached the patch and prepared one for 0.94 also.

        Show
        ramkrishna.s.vasudevan added a comment - Reattached the patch and prepared one for 0.94 also.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12523534/HBASE-5635_0.94.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.coprocessor.TestMasterObserver
        org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
        org.apache.hadoop.hbase.master.TestSplitLogManager

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1594//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1594//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1594//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12523534/HBASE-5635_0.94.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestMasterObserver org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1594//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1594//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1594//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The above failure in TestSplitLogManager QA does not repeat. Prev run on QA has passed.
        Will look into it more before committing.

        Show
        ramkrishna.s.vasudevan added a comment - The above failure in TestSplitLogManager QA does not repeat. Prev run on QA has passed. Will look into it more before committing.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed to trunk and 0.94. The testcase failure is not due to this patch. Ran TestSplitLogManager multiple times and did not fail.

        Thanks for the patch Chinna.
        Thanks for the review Ted, Lars and Anoop.

        Show
        ramkrishna.s.vasudevan added a comment - Committed to trunk and 0.94. The testcase failure is not due to this patch. Ran TestSplitLogManager multiple times and did not fail. Thanks for the patch Chinna. Thanks for the review Ted, Lars and Anoop.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2798 (See https://builds.apache.org/job/HBase-TRUNK/2798/)
        HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Ram) (Revision 1329322)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2798 (See https://builds.apache.org/job/HBase-TRUNK/2798/ ) HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Ram) (Revision 1329322) Result = FAILURE ramkrishna : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #136 (See https://builds.apache.org/job/HBase-0.94/136/)
        HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Chinna) (Revision 1329325)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #136 (See https://builds.apache.org/job/HBase-0.94/136/ ) HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Chinna) (Revision 1329325) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Hide
        Anoop Sam John added a comment -

        As per the patch the below variable is of no use now

        this.zkretries = conf.getLong("hbase.splitlog.zk.retries", 3);

        @Ted
        I mean the variable zkretries not the config parameter. Yes this parameter is used in SplitLogManager.

        Show
        Anoop Sam John added a comment - As per the patch the below variable is of no use now this .zkretries = conf.getLong( "hbase.splitlog.zk.retries" , 3); @Ted I mean the variable zkretries not the config parameter. Yes this parameter is used in SplitLogManager.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #182 (See https://builds.apache.org/job/HBase-TRUNK-security/182/)
        HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Ram) (Revision 1329322)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #182 (See https://builds.apache.org/job/HBase-TRUNK-security/182/ ) HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Ram) (Revision 1329322) Result = SUCCESS ramkrishna : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #20 (See https://builds.apache.org/job/HBase-0.94-security/20/)
        HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Chinna) (Revision 1329325)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #20 (See https://builds.apache.org/job/HBase-0.94-security/20/ ) HBASE-5635 If getTaskList() returns null, splitlogWorker would go down and it won't serve any requests (Chinna) (Revision 1329325) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java

          People

          • Assignee:
            Chinna Rao Lalam
            Reporter:
            Kristam Subba Swathi
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development