HBase
  1. HBase
  2. HBASE-7937

Retry log rolling to support HA NN scenario

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.94.5
    • Fix Version/s: None
    • Component/s: wal
    • Labels:
      None

      Description

      A failure in log rolling causes regionserver abort. In case of HA NN, it will be good if there is a retry mechanism to roll the logs.
      A corresponding jira for MemStore retries is HBASE-7507.

      1. HBASE-7937-v1.patch
        6 kB
        Himanshu Vashishtha
      2. HBASE-7937-trunk.patch
        6 kB
        Himanshu Vashishtha
      3. HBase-7937-0.94.txt
        4 kB
        Liang Xie
      4. HBase-7937-trunk.txt
        4 kB
        Liang Xie

        Activity

        Hide
        Himanshu Vashishtha added a comment -

        0.94 patch.
        mvn test -P localTests -Dtest=TestLogRolling,TestHLog,TestLogRollAbort pass
        Running a jenkins now.

        Show
        Himanshu Vashishtha added a comment - 0.94 patch. mvn test -P localTests -Dtest=TestLogRolling,TestHLog,TestLogRollAbort pass Running a jenkins now.
        Hide
        Himanshu Vashishtha added a comment -

        jenkins job passes.

        Show
        Himanshu Vashishtha added a comment - jenkins job passes.
        Hide
        Ted Yu added a comment -

        Can you attach patch for trunk ?

        +  public static boolean isFileSystemAvailable(final FileSystem fs) {
        

        Please add javadoc for parameter 'fs'

        Show
        Ted Yu added a comment - Can you attach patch for trunk ? + public static boolean isFileSystemAvailable( final FileSystem fs) { Please add javadoc for parameter 'fs'
        Hide
        Himanshu Vashishtha added a comment -

        trunk patch with Ted's review comments.

        Show
        Himanshu Vashishtha added a comment - trunk patch with Ted's review comments.
        Hide
        stack added a comment -

        Why extra spacing:

        + private int logRollRetryCount;

        Can these be final? Are they set in the Constructor?

        Why not just set it back to a default with a log that this has happened rather than fail the precondition?

        + Preconditions.checkArgument(this.logRollRetryCount > 0,
        + "Log Roll retry count should be a positive number");

        What is this about?

        + // there may be a case when fs has just become available; one can do one more retry
        + if (fsOk) throw ioe;

        So if fs is ok, we throw the ioe? Should we not go around again and retry? (In comment says 'there may be a case'.. what is the case?

        Are we incrementing in the for loop and below again when we do + i++; ? (i.e. incrementing in two places?)

        Do you think we should pause here:

        + Thread.sleep(pauseTime);

        How long is default?

        Are we holding up all writes when we are paused like this?

        Looks like duplicated code. Could it be avoided?

        You add a new method to FSUtils but its guts are from checkFileSystemAvailable. Why not make it so you do not duplicate code, have checkFileSystemAvailable call your new method or factor out a private method that both could use?

        Show
        stack added a comment - Why extra spacing: + private int logRollRetryCount; Can these be final? Are they set in the Constructor? Why not just set it back to a default with a log that this has happened rather than fail the precondition? + Preconditions.checkArgument(this.logRollRetryCount > 0, + "Log Roll retry count should be a positive number"); What is this about? + // there may be a case when fs has just become available; one can do one more retry + if (fsOk) throw ioe; So if fs is ok, we throw the ioe? Should we not go around again and retry? (In comment says 'there may be a case'.. what is the case? Are we incrementing in the for loop and below again when we do + i++; ? (i.e. incrementing in two places?) Do you think we should pause here: + Thread.sleep(pauseTime); How long is default? Are we holding up all writes when we are paused like this? Looks like duplicated code. Could it be avoided? You add a new method to FSUtils but its guts are from checkFileSystemAvailable. Why not make it so you do not duplicate code, have checkFileSystemAvailable call your new method or factor out a private method that both could use?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12571297/HBASE-7937-trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571297/HBASE-7937-trunk.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4588//console This message is automatically generated.
        Hide
        Himanshu Vashishtha added a comment -

        Thanks for taking a look.

        + private int logRollRetryCount;

        Yes, they are set in ctr; I will make them final, and make them default in case the value is <=0.

        // there may be a case when fs has just become available; one can do one more retry

        I was considering the case when a NN HA recovers in b/w we failed while doing an op, and checking via FSUtils#checkFSAvailable call. If that happens, it will be in a state for eg: fs.rename() threw an exception, but fs is healthy... so rethrow the exception to the caller. In actual, it should have done one more retry.
        I tried to cover that case with the fsOk variable. If you think this is not needed, I will remove it.

        incrementing twice.

        Sorry about that. I will fix this.

        Default pause time:

        1 sec; as defined in HConstants#DEFAULT_HBASE_SERVER_PAUSE

        Are we holding up all writes when we are paused like this?

        I don't think we are. We are in the retrying loop at two places here:
        a) Creating a new log writer
        b) Archiving old logs
        As long as we haven't created a new writer, we don't change the old log writer. So, we are still pointing to the old hlog.
        Archiving old logs shouldn't be a blocking call. If it is, it is a bug.

        refactoring..

        Will do.

        TestHLogSplit passes on local. I didn't change the LogSplitter code. Tried to keep its scope minimum.

        Show
        Himanshu Vashishtha added a comment - Thanks for taking a look. + private int logRollRetryCount; Yes, they are set in ctr; I will make them final, and make them default in case the value is <=0. // there may be a case when fs has just become available; one can do one more retry I was considering the case when a NN HA recovers in b/w we failed while doing an op, and checking via FSUtils#checkFSAvailable call. If that happens, it will be in a state for eg: fs.rename() threw an exception, but fs is healthy... so rethrow the exception to the caller. In actual, it should have done one more retry. I tried to cover that case with the fsOk variable. If you think this is not needed, I will remove it. incrementing twice. Sorry about that. I will fix this. Default pause time: 1 sec; as defined in HConstants#DEFAULT_HBASE_SERVER_PAUSE Are we holding up all writes when we are paused like this? I don't think we are. We are in the retrying loop at two places here: a) Creating a new log writer b) Archiving old logs As long as we haven't created a new writer, we don't change the old log writer. So, we are still pointing to the old hlog. Archiving old logs shouldn't be a blocking call. If it is, it is a bug. refactoring.. Will do. TestHLogSplit passes on local. I didn't change the LogSplitter code. Tried to keep its scope minimum.
        Hide
        stack added a comment -

        Himanshu Vashishtha Any chance of a new patch here?

        Show
        stack added a comment - Himanshu Vashishtha Any chance of a new patch here?
        Hide
        Liang Xie added a comment -

        Hi Himanshu Vashishtha, i can make a new patch per comments if you are busy. I am working logrolling retry in our internal codebase, so it's handy for me.

        Show
        Liang Xie added a comment - Hi Himanshu Vashishtha , i can make a new patch per comments if you are busy. I am working logrolling retry in our internal codebase, so it's handy for me.
        Hide
        Liang Xie added a comment -

        Attached is a patch for 0.94 branch

        Show
        Liang Xie added a comment - Attached is a patch for 0.94 branch
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12577433/HBase-7937-0.94.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5169//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577433/HBase-7937-0.94.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5169//console This message is automatically generated.
        Hide
        Liang Xie added a comment -

        Uploaded patch for trunk

        Show
        Liang Xie added a comment - Uploaded patch for trunk
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12577439/HBase-7937-trunk.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.security.access.TestAccessController

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577439/HBase-7937-trunk.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.security.access.TestAccessController Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5172//console This message is automatically generated.
        Hide
        Himanshu Vashishtha added a comment -

        Hey Liang, looking at the 0.94 patch, it looks it is redoing what was done by HBaseFileSystem and HFlogFileSystem. They already have the retrying logic. If we this route, then we will be defining properties for retry number/wait time, etc for all such processes.

        I am working on something similar to HBASE-8211 for 0.95, and fold the hlog rolling retry in that effort.

        Show
        Himanshu Vashishtha added a comment - Hey Liang, looking at the 0.94 patch, it looks it is redoing what was done by HBaseFileSystem and HFlogFileSystem. They already have the retrying logic. If we this route, then we will be defining properties for retry number/wait time, etc for all such processes. I am working on something similar to HBASE-8211 for 0.95, and fold the hlog rolling retry in that effort.
        Hide
        Liang Xie added a comment -

        O,i c, seems my attached code is only suitable for older release, e.g. 0.94.3 got it, thanks for explaining, Himanshu Vashishtha.

        Show
        Liang Xie added a comment - O,i c, seems my attached code is only suitable for older release, e.g. 0.94.3 got it, thanks for explaining, Himanshu Vashishtha .
        Hide
        stack added a comment -

        Himanshu Vashishtha We expecting a new patch for trunk on this issue? Thanks.

        Show
        stack added a comment - Himanshu Vashishtha We expecting a new patch for trunk on this issue? Thanks.
        Hide
        stack added a comment -

        This is for 0.94

        Show
        stack added a comment - This is for 0.94

          People

          • Assignee:
            Himanshu Vashishtha
            Reporter:
            Himanshu Vashishtha
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development