HBase
  1. HBase
  2. HBASE-7507

Make memstore flush be able to retry after exception

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.94.6, 0.95.0
    • Fix Version/s: 0.94.6, 0.95.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We will abort regionserver if memstore flush throws exception.

      I thinks we could do retry to make regionserver more stable because file system may be not ok in a transient time. e.g. Switching namenode in the NamenodeHA environment

      HRegion#internalFlushcache(){
      
      ...
      try {
      ...
      }catch(Throwable t){
      DroppedSnapshotException dse = new DroppedSnapshotException("region: " +
                Bytes.toStringBinary(getRegionName()));
      dse.initCause(t);
      throw dse;
      }
      ...
      
      }
      
      MemStoreFlusher#flushRegion(){
      ...
      region.flushcache();
      ...
       try {
      }catch(DroppedSnapshotException ex){
      server.abort("Replay of HLog required. Forcing server shutdown", ex);
      }
      
      ...
      }
      
      1. 7507-trunk v1.patch
        2 kB
        chunhui shen
      2. 7507-trunk v2.patch
        5 kB
        chunhui shen
      3. 7507-trunkv3.patch
        4 kB
        chunhui shen
      4. 7507-94.patch
        4 kB
        chunhui shen

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -
          +          lastException = new IOException(e);
          

          If e is instanceof IOException, we don't need to wrap it, right ?

          +          // Path name is null if there is no entries to flush
          

          'no entries' -> 'no entry'

          +            validateStoreFile(pathName);
          

          Why moving the location of validateStoreFile() call ?

          Show
          Ted Yu added a comment - + lastException = new IOException(e); If e is instanceof IOException, we don't need to wrap it, right ? + // Path name is null if there is no entries to flush 'no entries' -> 'no entry' + validateStoreFile(pathName); Why moving the location of validateStoreFile() call ?
          Hide
          ramkrishna.s.vasudevan added a comment -

          Seems good Chunhui. Why don't we expose it as a configuration with default 1 so that any one can change it if they have an env as stated by your usecase.

          Show
          ramkrishna.s.vasudevan added a comment - Seems good Chunhui. Why don't we expose it as a configuration with default 1 so that any one can change it if they have an env as stated by your usecase.
          Hide
          stack added a comment -

          I'd be fine making this configurable as long as retry enabled was the default.

          Flushing is not the only write we do. Should we open a new issue to retry all hdfs operations? Put the retries into our wrapper around our filesystem instance, HFileSystem?

          Show
          stack added a comment - I'd be fine making this configurable as long as retry enabled was the default. Flushing is not the only write we do. Should we open a new issue to retry all hdfs operations? Put the retries into our wrapper around our filesystem instance, HFileSystem?
          Hide
          Andrew Purtell added a comment -

          Should we open a new issue to retry all hdfs operations? Put the retries into our wrapper around our filesystem instance, HFileSystem?

          We could have methods that accept parameters for the op and then an optional retry count/flag? Some cases won't want to retry?

          Show
          Andrew Purtell added a comment - Should we open a new issue to retry all hdfs operations? Put the retries into our wrapper around our filesystem instance, HFileSystem? We could have methods that accept parameters for the op and then an optional retry count/flag? Some cases won't want to retry?
          Hide
          chunhui shen added a comment -

          Why moving the location of validateStoreFile() call ?

          We can't do the retry in HStore#commitFile, but we could do the retry if failed validateStoreFile(), so move its location

          stack

          Should we open a new issue to retry all hdfs operations?

          We will do the hdfs operations for HFile and HLog, and we could tolerate IO errors in HLog now.
          So I think retry for flush is enough since IO errors in compaction are nothing matter

          For other comments, I will address in new patch

          Thanks

          Show
          chunhui shen added a comment - Why moving the location of validateStoreFile() call ? We can't do the retry in HStore#commitFile, but we could do the retry if failed validateStoreFile(), so move its location stack Should we open a new issue to retry all hdfs operations? We will do the hdfs operations for HFile and HLog, and we could tolerate IO errors in HLog now. So I think retry for flush is enough since IO errors in compaction are nothing matter For other comments, I will address in new patch Thanks
          Hide
          Lars Hofhansl added a comment -

          Seems this would be important in 0.94 as well (unless it relies on some 0.96 only stuff)

          Show
          Lars Hofhansl added a comment - Seems this would be important in 0.94 as well (unless it relies on some 0.96 only stuff)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12563715/7507-trunk%20v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.replication.TestReplicationWithCompression
          org.apache.hadoop.hbase.TestLocalHBaseCluster

          -1 core zombie tests. There are 2 zombie test(s):

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563715/7507-trunk%20v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationWithCompression org.apache.hadoop.hbase.TestLocalHBaseCluster -1 core zombie tests . There are 2 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3930//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Patch v2 introduced two more integer fields into HStore.
          Can we use static variable, an example being closeCheckInterval, so that FIXED_OVERHEAD doesn't increase ?

                HStore.closeCheckInterval = conf.getInt(
                    "hbase.hstore.close.check.interval", 10*1000*1000 /* 10 MB */);
          
          Show
          Ted Yu added a comment - Patch v2 introduced two more integer fields into HStore. Can we use static variable, an example being closeCheckInterval, so that FIXED_OVERHEAD doesn't increase ? HStore.closeCheckInterval = conf.getInt( "hbase.hstore.close.check.interval" , 10*1000*1000 /* 10 MB */);
          Hide
          chunhui shen added a comment -

          Attaching patch v3 as Ted's suggestion

          Show
          chunhui shen added a comment - Attaching patch v3 as Ted's suggestion
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12564325/7507-trunkv3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564325/7507-trunkv3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3974//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          +1 on latest patch.
          Nice to see a green QA run.

          Show
          Ted Yu added a comment - +1 on latest patch. Nice to see a green QA run.
          Hide
          Lars Hofhansl added a comment - - edited

          What if I do not want to retry? With the current patch that is impossible (setting the retry number to 0 will force the code to reget the parameter on each call).

          Show
          Lars Hofhansl added a comment - - edited What if I do not want to retry? With the current patch that is impossible (setting the retry number to 0 will force the code to reget the parameter on each call).
          Hide
          chunhui shen added a comment -

          Lars Hofhansl

          What if I do not want to retry?

          Set HStore.flush_retries_number as 1, it means no retry. Maybe name "retries" is not suitable. But we use 'retries' in "hbase.client.retries.number", in order to keep the same, here I use the name 'retries'.
          Setting 'hbase.client.retries.number' as 1 means no retry, did we changed this semantic?

          setting the retry number to 0 will force the code to reget the parameter on each call

          +      if (HStore.flush_retries_number <= 0) {
          +        throw new IllegalArgumentException(
          +            "hbase.hstore.flush.retries.number must be > 0, not "
          +                + HStore.flush_retries_number);
          +      }
          

          Server will go down if set the retry number to 0

          Show
          chunhui shen added a comment - Lars Hofhansl What if I do not want to retry? Set HStore.flush_retries_number as 1, it means no retry. Maybe name "retries" is not suitable. But we use 'retries' in "hbase.client.retries.number", in order to keep the same, here I use the name 'retries'. Setting 'hbase.client.retries.number' as 1 means no retry, did we changed this semantic? setting the retry number to 0 will force the code to reget the parameter on each call + if (HStore.flush_retries_number <= 0) { + throw new IllegalArgumentException( + "hbase.hstore.flush.retries.number must be > 0, not " + + HStore.flush_retries_number); + } Server will go down if set the retry number to 0
          Hide
          Lars Hofhansl added a comment -

          You're right (looked at the patch again).
          +1 then

          Show
          Lars Hofhansl added a comment - You're right (looked at the patch again). +1 then
          Hide
          chunhui shen added a comment -

          Commit it tomorrow if no objections
          Thanks for the review, Ted, Ram, Lars

          Show
          chunhui shen added a comment - Commit it tomorrow if no objections Thanks for the review, Ted, Ram, Lars
          Hide
          Lars Hofhansl added a comment -

          Feel free to commit Chunhui

          Show
          Lars Hofhansl added a comment - Feel free to commit Chunhui
          Hide
          chunhui shen added a comment -

          Integrated to trunk and branch 0.94

          Show
          chunhui shen added a comment - Integrated to trunk and branch 0.94
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #3772 (See https://builds.apache.org/job/HBase-TRUNK/3772/)
          HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436111)

          Result = FAILURE
          zjushch :
          Files :

          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #3772 (See https://builds.apache.org/job/HBase-TRUNK/3772/ ) HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436111) Result = FAILURE zjushch : Files : /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #748 (See https://builds.apache.org/job/HBase-0.94/748/)
          HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436110)

          Result = FAILURE
          zjushch :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #748 (See https://builds.apache.org/job/HBase-0.94/748/ ) HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436110) Result = FAILURE zjushch : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #364 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/364/)
          HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436111)

          Result = FAILURE
          zjushch :
          Files :

          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #364 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/364/ ) HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436111) Result = FAILURE zjushch : Files : /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #96 (See https://builds.apache.org/job/HBase-0.94-security/96/)
          HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436110)

          Result = FAILURE
          zjushch :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #96 (See https://builds.apache.org/job/HBase-0.94-security/96/ ) HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436110) Result = FAILURE zjushch : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Hide
          Lars Hofhansl added a comment -

          I think this could be an candidate for revert in light of recent unspecific test failures. I'll double check before I do that.

          Show
          Lars Hofhansl added a comment - I think this could be an candidate for revert in light of recent unspecific test failures. I'll double check before I do that.
          Hide
          Lars Hofhansl added a comment -

          I would like to revert this from 0.94. Unless we wrap every write to HDFS inside a retry loop we have not gained anything anyway.
          Please speak up if you disagree.

          Show
          Lars Hofhansl added a comment - I would like to revert this from 0.94. Unless we wrap every write to HDFS inside a retry loop we have not gained anything anyway. Please speak up if you disagree.
          Hide
          chunhui shen added a comment -

          I have no objection for revert.

          Show
          chunhui shen added a comment - I have no objection for revert.
          Hide
          Lars Hofhansl added a comment -

          Reverted from 0.94. Sorry chunhui shen, there just have been test issues, and I would like to reduce the number of variables.

          Show
          Lars Hofhansl added a comment - Reverted from 0.94. Sorry chunhui shen , there just have been test issues, and I would like to reduce the number of variables.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #795 (See https://builds.apache.org/job/HBase-0.94/795/)
          HBASE-7507 Revert, due to 0.94 test stability issues (Revision 1439534)

          Result = SUCCESS
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #795 (See https://builds.apache.org/job/HBase-0.94/795/ ) HBASE-7507 Revert, due to 0.94 test stability issues (Revision 1439534) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #102 (See https://builds.apache.org/job/HBase-0.94-security/102/)
          HBASE-7507 Revert, due to 0.94 test stability issues (Revision 1439534)

          Result = SUCCESS
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #102 (See https://builds.apache.org/job/HBase-0.94-security/102/ ) HBASE-7507 Revert, due to 0.94 test stability issues (Revision 1439534) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Hide
          Enis Soztutar added a comment -

          I just saw this. I've encountered the same thing and logged it HBASE-7385. My proposal there would be not to retry in the same flush request. But on exception, just retain the same snapshot, and on the next flush request comes, it will just reuse the same snapshot created earlier. What do you thing about this approach?

          Show
          Enis Soztutar added a comment - I just saw this. I've encountered the same thing and logged it HBASE-7385 . My proposal there would be not to retry in the same flush request. But on exception, just retain the same snapshot, and on the next flush request comes, it will just reuse the same snapshot created earlier. What do you thing about this approach?
          Hide
          Enis Soztutar added a comment -

          Reopened due to revert.

          Show
          Enis Soztutar added a comment - Reopened due to revert.
          Hide
          Lars Hofhansl added a comment -

          I had removed the 0.94 target and this is still in 0.96 so resolving is still valid.

          Show
          Lars Hofhansl added a comment - I had removed the 0.94 target and this is still in 0.96 so resolving is still valid.
          Hide
          Enis Soztutar added a comment -

          If the test failed on 0.94 due to this, shouldn't we revert fomr trunk as well?

          Show
          Enis Soztutar added a comment - If the test failed on 0.94 due to this, shouldn't we revert fomr trunk as well?
          Hide
          Lars Hofhansl added a comment -

          I still do not know if this effected any tests. Random tests were timing out or fail for other reasons that I hadn't seen before.
          In general we want that feature. Just that in 0.94 stability trumps a new feature.

          Show
          Lars Hofhansl added a comment - I still do not know if this effected any tests. Random tests were timing out or fail for other reasons that I hadn't seen before. In general we want that feature. Just that in 0.94 stability trumps a new feature.
          Hide
          Enis Soztutar added a comment -

          Sounds fair. Chunhui, what do you think about the test failures? Shall we keep the trunk patch? We probably want this in 0.94.6.

          Show
          Enis Soztutar added a comment - Sounds fair. Chunhui, what do you think about the test failures? Shall we keep the trunk patch? We probably want this in 0.94.6.
          Hide
          chunhui shen added a comment -

          I could take a deep look about the test failures for the root cause.

          IMO, many test cases are not very stable because of itself or potential bug

          Show
          chunhui shen added a comment - I could take a deep look about the test failures for the root cause. IMO, many test cases are not very stable because of itself or potential bug
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/)
          HBASE-7507 Revert, due to 0.94 test stability issues (Revision 1439534)
          HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436110)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java

          zjushch :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/ ) HBASE-7507 Revert, due to 0.94 test stability issues (Revision 1439534) HBASE-7507 Make memstore flush be able to retry after exception (Chunhui) (Revision 1436110) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java zjushch : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          Hide
          stack added a comment -

          Do we know what tests were failing when this patch was in 0.94? Do you remember Lars Hofhansl?

          Show
          stack added a comment - Do we know what tests were failing when this patch was in 0.94? Do you remember Lars Hofhansl ?
          Hide
          stack added a comment -

          This is an important one for riding over ha nn topology changes (as per Chunhui). Was seen on a cluster today.

          Show
          stack added a comment - This is an important one for riding over ha nn topology changes (as per Chunhui). Was seen on a cluster today.
          Hide
          Lars Hofhansl added a comment -

          I don't know whether this caused the test failures. At this point I'd say probably not.

          IMHO, this particular fix is only important if we have fixed all other write attempts for HDFS.

          Show
          Lars Hofhansl added a comment - I don't know whether this caused the test failures. At this point I'd say probably not. IMHO, this particular fix is only important if we have fixed all other write attempts for HDFS.
          Hide
          Himanshu Vashishtha added a comment -

          This patch looks safe to me (shouldn't introduce any flakiness as such). Ran it on jenkins on the current 0.94 and it was green. Rather, I think instead of re-trying the flush operation, why not just check whether the file system is available or not in a re-trying mode? That should be more efficient. Or, yo have considered that already?

          The other possible candidates in a running cluster I can see are Compaction and Log rolling. The former can be made to check the file system health in a retrying manner (if people agree, I can upload a patch for that).
          The log rolling looks a bit tricky because there are two idempotent operations involved: Creating a new HLog writer, and closing the existing one. Having a retrying loop for these (especially creating a new hlog file in the .logs directory) doesn't look to be a good idea. I would avoid doing that.
          Looking for more opinions?

          Show
          Himanshu Vashishtha added a comment - This patch looks safe to me (shouldn't introduce any flakiness as such). Ran it on jenkins on the current 0.94 and it was green. Rather, I think instead of re-trying the flush operation, why not just check whether the file system is available or not in a re-trying mode? That should be more efficient. Or, yo have considered that already? The other possible candidates in a running cluster I can see are Compaction and Log rolling. The former can be made to check the file system health in a retrying manner (if people agree, I can upload a patch for that). The log rolling looks a bit tricky because there are two idempotent operations involved: Creating a new HLog writer, and closing the existing one. Having a retrying loop for these (especially creating a new hlog file in the .logs directory) doesn't look to be a good idea. I would avoid doing that. Looking for more opinions?
          Hide
          Enis Soztutar added a comment -

          This is an important one for riding over ha nn topology changes (as per Chunhui). Was seen on a cluster today.

          As I reported in HBASE-7385, we've also seen this in NN HA tests.

          IMHO, this particular fix is only important if we have fixed all other write attempts for HDFS.

          We have seen some other edge case, where NN dies just before returning the RPC response for create file, next retry from the DFS client fails due to file already exists exception. I think I've logged it somewhere. Regardless, I think, fixing the memstore flush is important, since it causes RS to abort on fail.

          Should we commit it, and if tests start failing, fix them later?

          Show
          Enis Soztutar added a comment - This is an important one for riding over ha nn topology changes (as per Chunhui). Was seen on a cluster today. As I reported in HBASE-7385 , we've also seen this in NN HA tests. IMHO, this particular fix is only important if we have fixed all other write attempts for HDFS. We have seen some other edge case, where NN dies just before returning the RPC response for create file, next retry from the DFS client fails due to file already exists exception. I think I've logged it somewhere. Regardless, I think, fixing the memstore flush is important, since it causes RS to abort on fail. Should we commit it, and if tests start failing, fix them later?
          Hide
          Andrew Purtell added a comment -

          Should we commit it, and if tests start failing, fix them later?

          I would be curious to see if committing this again causes an uptick in test failure or not, if the previous observation was in fact correlated.

          Show
          Andrew Purtell added a comment - Should we commit it, and if tests start failing, fix them later? I would be curious to see if committing this again causes an uptick in test failure or not, if the previous observation was in fact correlated.
          Hide
          chunhui shen added a comment -

          The other possible candidates in a running cluster I can see are Compaction and Log rolling

          Failure hdfs operation of Compaction and Log rolling won't cause RS to abort, so I think these failures are acceptable.

          Show
          chunhui shen added a comment - The other possible candidates in a running cluster I can see are Compaction and Log rolling Failure hdfs operation of Compaction and Log rolling won't cause RS to abort, so I think these failures are acceptable.
          Hide
          stack added a comment -

          A review of all FS access to ferret out where we should retry and where we should not is a big job. This small patch improves the situation. It is in trunk already. Would you be up for Andy's suggestion Lars of trying this patch on 0.94 branch? As Chunhui says, this patch addresses at least the most aggrevating case, the one that causes a RS crash out. It is an ugly patch. We should work on something more comprehensive as Himanshu Vashishtha says in trunk/0.96 (thanks).

          Show
          stack added a comment - A review of all FS access to ferret out where we should retry and where we should not is a big job. This small patch improves the situation. It is in trunk already. Would you be up for Andy's suggestion Lars of trying this patch on 0.94 branch? As Chunhui says, this patch addresses at least the most aggrevating case, the one that causes a RS crash out. It is an ugly patch. We should work on something more comprehensive as Himanshu Vashishtha says in trunk/0.96 (thanks).
          Hide
          Lars Hofhansl added a comment -

          Yeah, let's recommit. +1

          Show
          Lars Hofhansl added a comment - Yeah, let's recommit. +1
          Hide
          Himanshu Vashishtha added a comment -

          +1 to the re-commit.

          chunhui shen Yes compaction will not abort the rs, but HLog rolling will abort it. It's in the LogRoller#run where it catches IOException and calls the abort. If we go with the re-trying logic, we need it at two places there. First, at the log creation (one can do so at the HLog#createWriter()), and other while archiving old logs.

          I agree with Stack that it is not a good fix, and I would try to avoid this for log rolling (which is not so frequent as flushes are).

          Show
          Himanshu Vashishtha added a comment - +1 to the re-commit. chunhui shen Yes compaction will not abort the rs, but HLog rolling will abort it. It's in the LogRoller#run where it catches IOException and calls the abort. If we go with the re-trying logic, we need it at two places there. First, at the log creation (one can do so at the HLog#createWriter()), and other while archiving old logs. I agree with Stack that it is not a good fix, and I would try to avoid this for log rolling (which is not so frequent as flushes are).
          Hide
          stack added a comment -

          Regards the 0.94 patch, why are these statics?

          + private static final int DEFAULT_FLUSH_RETRIES_NUMBER = 10;
          + private static int flush_retries_number;
          + private static int pauseTime;

          Patch looks fine. Pity as said already that we have to localize the retry and rather we can't put all the retries behind a filesystem facade; we can do this for 0.96....

          I wonder what happens retrying after an IOE. Is it ok retrying any IOE? Has the flush path been reviewed to make sure only IOE is failed NN op? Is it possible to get an IOE after the file has been successfully written?

          Just wondering. Would say commit – because helps us work with HA NN (HANN).

          Show
          stack added a comment - Regards the 0.94 patch, why are these statics? + private static final int DEFAULT_FLUSH_RETRIES_NUMBER = 10; + private static int flush_retries_number; + private static int pauseTime; Patch looks fine. Pity as said already that we have to localize the retry and rather we can't put all the retries behind a filesystem facade; we can do this for 0.96.... I wonder what happens retrying after an IOE. Is it ok retrying any IOE? Has the flush path been reviewed to make sure only IOE is failed NN op? Is it possible to get an IOE after the file has been successfully written? Just wondering. Would say commit – because helps us work with HA NN (HANN).
          Hide
          stack added a comment -

          Oh, I'll commit in a while unless I am beaten to it.

          Show
          stack added a comment - Oh, I'll commit in a while unless I am beaten to it.
          Hide
          stack added a comment -

          Here is issue used to reapply 0.94 version of the patch to 0.94 branch.

          Show
          stack added a comment - Here is issue used to reapply 0.94 version of the patch to 0.94 branch.
          Hide
          stack added a comment -

          Done. Hopefully it doesn't mess up the 0.94 tests. Flag me and I'll remove it again.

          Show
          stack added a comment - Done. Hopefully it doesn't mess up the 0.94 tests. Flag me and I'll remove it again.
          Hide
          stack added a comment -

          This was committed to trunk a while back. A new issue just committed it to 0.94. Resolving.

          Show
          stack added a comment - This was committed to trunk a while back. A new issue just committed it to 0.94. Resolving.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #858 (See https://builds.apache.org/job/HBase-0.94/858/)
          HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch (Revision 1449779)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in HBase-0.94 #858 (See https://builds.apache.org/job/HBase-0.94/858/ ) HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch (Revision 1449779) Result = FAILURE
          Hide
          chunhui shen added a comment -

          I wonder what happens retrying after an IOE. Is it ok retrying any IOE? Has the flush path been reviewed to make sure only IOE is failed NN op? Is it possible to get an IOE after the file has been successfully written?

          Before flushed file committed, we could always retry flushing.

          Flush path is always unique by StoreFile#getUniqueFile.

          Store#validateStoreFile may throw exception after successfully written when switching namenode in the NamenodeHA environment. We have encountered this case.

          Show
          chunhui shen added a comment - I wonder what happens retrying after an IOE. Is it ok retrying any IOE? Has the flush path been reviewed to make sure only IOE is failed NN op? Is it possible to get an IOE after the file has been successfully written? Before flushed file committed, we could always retry flushing. Flush path is always unique by StoreFile#getUniqueFile. Store#validateStoreFile may throw exception after successfully written when switching namenode in the NamenodeHA environment. We have encountered this case.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #112 (See https://builds.apache.org/job/HBase-0.94-security/112/)
          HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch (Revision 1449779)

          Result = SUCCESS

          Show
          Hudson added a comment - Integrated in HBase-0.94-security #112 (See https://builds.apache.org/job/HBase-0.94-security/112/ ) HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch (Revision 1449779) Result = SUCCESS
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #869 (See https://builds.apache.org/job/HBase-0.94/869/)
          HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch. (Revision 1451087)

          Result = SUCCESS

          Show
          Hudson added a comment - Integrated in HBase-0.94 #869 (See https://builds.apache.org/job/HBase-0.94/869/ ) HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch. (Revision 1451087) Result = SUCCESS
          Hide
          stack added a comment -

          Thanks chunhui shen

          Show
          stack added a comment - Thanks chunhui shen
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #113 (See https://builds.apache.org/job/HBase-0.94-security/113/)
          HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch. (Revision 1451087)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in HBase-0.94-security #113 (See https://builds.apache.org/job/HBase-0.94-security/113/ ) HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch. (Revision 1451087) Result = FAILURE
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/)
          HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch. (Revision 1451087)
          HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch (Revision 1449779)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/ ) HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch. (Revision 1451087) HBASE-7929 Reapply hbase-7507 "Make memstore flush be able to retry after exception" to 0.94 branch (Revision 1449779) Result = FAILURE
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #957 (See https://builds.apache.org/job/HBase-0.94/957/)
          HBASE-7929 Reapply hbase-7507 'Make memstore flush be able to retry after exception' to 0.94 branch. (Original patch by chunhui shen) (Revision 1467121)

          Result = SUCCESS

          Show
          Hudson added a comment - Integrated in HBase-0.94 #957 (See https://builds.apache.org/job/HBase-0.94/957/ ) HBASE-7929 Reapply hbase-7507 'Make memstore flush be able to retry after exception' to 0.94 branch. (Original patch by chunhui shen) (Revision 1467121) Result = SUCCESS
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #134 (See https://builds.apache.org/job/HBase-0.94-security/134/)
          HBASE-7929 Reapply hbase-7507 'Make memstore flush be able to retry after exception' to 0.94 branch. (Original patch by chunhui shen) (Revision 1467121)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in HBase-0.94-security #134 (See https://builds.apache.org/job/HBase-0.94-security/134/ ) HBASE-7929 Reapply hbase-7507 'Make memstore flush be able to retry after exception' to 0.94 branch. (Original patch by chunhui shen) (Revision 1467121) Result = FAILURE

            People

            • Assignee:
              chunhui shen
              Reporter:
              chunhui shen
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development