HBase
  1. HBase
  2. HBASE-5890

SplitLog Rescan BusyWaits upon Zk.CONNECTIONLOSS

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We ran into a production issue yesterday where the SplitLogManager tried to create a Rescan node in ZK. The createAsync() generated a KeeperException.CONNECTIONLOSS that was immedately sent to processResult(), createRescan node with --retry_count was called, and this created a CPU busywait that also clogged up the logs. We should handle this better.

      1. HBASE-5890.patch
        2 kB
        Nicolas Spiegelberg

        Activity

        Hide
        Nicolas Spiegelberg added a comment -

        The original idea is to have a timeout when we encounter this error. Since we have a recoverable ZK, it seems okay to retry after connection loss; but we should have some sort of dampening so that this isn't a CPU & log hog.

        Show
        Nicolas Spiegelberg added a comment - The original idea is to have a timeout when we encounter this error. Since we have a recoverable ZK, it seems okay to retry after connection loss; but we should have some sort of dampening so that this isn't a CPU & log hog.
        Hide
        Nicolas Spiegelberg added a comment -

        patch should work for 89fb, 94, and trunk

        Show
        Nicolas Spiegelberg added a comment - patch should work for 89fb, 94, and trunk
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12524903/HBASE-5890.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12524903/HBASE-5890.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Patch looks good.
        In the catch clause, the following statement should be added:

                  Thread.currentThread().interrupt();
        
        Show
        Ted Yu added a comment - Patch looks good. In the catch clause, the following statement should be added: Thread .currentThread().interrupt();
        Hide
        Prakash Khemani added a comment -

        Most likely, it isn't a good idea to sleep in the zookeeper callback thread. (isn't the zk client single threaded?)

        Can these be queued in a DelayedQueue(socket-timeout) and retried from SplitLogManager.TimeoutMonitor.chore()

        Show
        Prakash Khemani added a comment - Most likely, it isn't a good idea to sleep in the zookeeper callback thread. (isn't the zk client single threaded?) Can these be queued in a DelayedQueue(socket-timeout) and retried from SplitLogManager.TimeoutMonitor.chore()
        Hide
        Ted Yu added a comment -

        Prakash made a good point.

            this.timeoutMonitor = new TimeoutMonitor(
                conf.getInt("hbase.splitlog.manager.timeoutmonitor.period",
                    1000),
        

        TimeoutMonitor runs at interval longer than socket timeout. If the default 1 sec interval (for TimeoutMonitor) is acceptable for the delay, this approach would work.

        Show
        Ted Yu added a comment - Prakash made a good point. this .timeoutMonitor = new TimeoutMonitor( conf.getInt( "hbase.splitlog.manager.timeoutmonitor.period" , 1000), TimeoutMonitor runs at interval longer than socket timeout. If the default 1 sec interval (for TimeoutMonitor) is acceptable for the delay, this approach would work.
        Hide
        Lars Hofhansl added a comment -

        Important for 0.94.0? Just say, and I wait.

        Show
        Lars Hofhansl added a comment - Important for 0.94.0? Just say, and I wait.
        Hide
        Lars Hofhansl added a comment -

        Moving out for now.

        Show
        Lars Hofhansl added a comment - Moving out for now.
        Hide
        Lars Hofhansl added a comment -

        And on to 0.94.2

        Show
        Lars Hofhansl added a comment - And on to 0.94.2
        Hide
        Lars Hofhansl added a comment -

        No movement, unscheduling from 0.94.

        Show
        Lars Hofhansl added a comment - No movement, unscheduling from 0.94.
        Hide
        stack added a comment -

        Moving out of 0.95. Issue has merit but no assignee and good suggestions on how the patch could be improved.

        Show
        stack added a comment - Moving out of 0.95. Issue has merit but no assignee and good suggestions on how the patch could be improved.

          People

          • Assignee:
            Unassigned
            Reporter:
            Nicolas Spiegelberg
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development