Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.99.0
    • Component/s: MTTR, wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Distributed Log Replay has been disabled again for 1.0.x releases. See HBASE-12577

      Description

      Enable 'distributed log replay' by default. Depends on hfilev3 being enabled.

      1. 10888v3.txt
        8 kB
        stack
      2. 10888v2.txt
        4 kB
        stack
      3. 10888v2.txt
        4 kB
        stack
      4. 10888v2.txt
        4 kB
        stack
      5. 10888.txt
        4 kB
        stack

        Issue Links

          Activity

          Hide
          stack added a comment -

          Enable distributed log replay as default. Checks that hfile is at least version 3 also.

          Show
          stack added a comment - Enable distributed log replay as default. Checks that hfile is at least version 3 also.
          Hide
          stack added a comment -

          FYI Jeffrey Zhong You seen any issues w/ this sir?

          Show
          stack added a comment - FYI Jeffrey Zhong You seen any issues w/ this sir?
          Hide
          Jeffrey Zhong added a comment -

          No, let's go for it. Cheers!

          Show
          Jeffrey Zhong added a comment - No, let's go for it. Cheers!
          Hide
          stack added a comment -

          Missing terminator in hbase-default.xml. This patch depends on hfile v3 patch being in first.

          Show
          stack added a comment - Missing terminator in hbase-default.xml. This patch depends on hfile v3 patch being in first.
          Hide
          stack added a comment -

          Trying hadoopqa anyways.

          Show
          stack added a comment - Trying hadoopqa anyways.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638104/10888v2.txt
          against trunk revision .
          ATTACHMENT ID: 12638104

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestMasterFileSystem
          org.apache.hadoop.hbase.master.TestDistributedLogSplitting

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638104/10888v2.txt against trunk revision . ATTACHMENT ID: 12638104 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.master.TestMasterFileSystem org.apache.hadoop.hbase.master.TestDistributedLogSplitting Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9162//console This message is automatically generated.
          Hide
          Jeffrey Zhong added a comment -

          Stack The test failures are due to distributedLogReplay is turned on by two configuration settings(FORMAT_VERSION_KEY & DISTRIBUTED_LOG_REPLAY_KEY) instead of one before. So if we add "conf.setInt("hfile.format.version", 3);" into those failed test cases. They should pass. We can also enable "conf.setInt("hfile.format.version", 3);" for all tests of TestDistributedLogSplitting. Thanks.

          Show
          Jeffrey Zhong added a comment - Stack The test failures are due to distributedLogReplay is turned on by two configuration settings(FORMAT_VERSION_KEY & DISTRIBUTED_LOG_REPLAY_KEY) instead of one before. So if we add "conf.setInt("hfile.format.version", 3);" into those failed test cases. They should pass. We can also enable "conf.setInt("hfile.format.version", 3);" for all tests of TestDistributedLogSplitting. Thanks.
          Hide
          stack added a comment -

          Excellent Jeffrey Zhong Thank you for taking a look. It would have taken me ages go figure it. I just committed v3 patch so let me rerun this patch... hopefully it will work now.

          Show
          stack added a comment - Excellent Jeffrey Zhong Thank you for taking a look. It would have taken me ages go figure it. I just committed v3 patch so let me rerun this patch... hopefully it will work now.
          Hide
          stack added a comment -

          Retry now hfilev3 is the default.

          Show
          stack added a comment - Retry now hfilev3 is the default.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638148/10888v2.txt
          against trunk revision .
          ATTACHMENT ID: 12638148

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:

          -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.hbase.master.TestMasterNoCluster.testNotPullingDeadRegionServerFromZK(TestMasterNoCluster.java:298)

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638148/10888v2.txt against trunk revision . ATTACHMENT ID: 12638148 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: -1 core zombie tests . There are 1 zombie test(s): at org.apache.hadoop.hbase.master.TestMasterNoCluster.testNotPullingDeadRegionServerFromZK(TestMasterNoCluster.java:298) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9165//console This message is automatically generated.
          Hide
          stack added a comment -

          Playing with this on cluster, seems to basically work. Logs are split and stuff comes back again afterward. Would need to run the linked list and chaos monkey for a while to make sure all really good but good enough to commit I'd say.

          Show
          stack added a comment - Playing with this on cluster, seems to basically work. Logs are split and stuff comes back again afterward. Would need to run the linked list and chaos monkey for a while to make sure all really good but good enough to commit I'd say.
          Hide
          stack added a comment -

          I also ran a 0.98 cluster, crashed it, and then started a 0.99 cluster over it w/ this patch and it split logs and made progress

          Running IntegrationTestMTTR seems to keep going too after fixing the conf dirs so it restarted the master w/ proper configs (it kills master all the time thinking it a regionserver).

          Let me see if I can get numbers to compare the recovery times.

          Show
          stack added a comment - I also ran a 0.98 cluster, crashed it, and then started a 0.99 cluster over it w/ this patch and it split logs and made progress Running IntegrationTestMTTR seems to keep going too after fixing the conf dirs so it restarted the master w/ proper configs (it kills master all the time thinking it a regionserver). Let me see if I can get numbers to compare the recovery times.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638187/10888v2.txt
          against trunk revision .
          ATTACHMENT ID: 12638187

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:

          -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.hbase.mapreduce.TestTableMapReduceBase.testMultiRegionTable(TestTableMapReduceBase.java:96)

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638187/10888v2.txt against trunk revision . ATTACHMENT ID: 12638187 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: -1 core zombie tests . There are 1 zombie test(s): at org.apache.hadoop.hbase.mapreduce.TestTableMapReduceBase.testMultiRegionTable(TestTableMapReduceBase.java:96) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9169//console This message is automatically generated.
          Hide
          stack added a comment -

          Here are some logging edits for DLR emissions.

          The failed test seems unrelated. This failed an odd time over on v3 patches.

          On getting numbers, its tough to compare since DLR savings are around not writing the intermediate edits file and then replaying them on open... would need to do an apples to apples compare rather than this random IntegrationTestMTTR.

          Show
          stack added a comment - Here are some logging edits for DLR emissions. The failed test seems unrelated. This failed an odd time over on v3 patches. On getting numbers, its tough to compare since DLR savings are around not writing the intermediate edits file and then replaying them on open... would need to do an apples to apples compare rather than this random IntegrationTestMTTR.
          Hide
          stack added a comment -

          Some rough numbers have it that they are about the same:

          0.98.1 took just over 6 seconds for ten logs (but one less splitter in the cluster since master now participates on trunk)

          2014-04-01 22:35:24,015 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2023.halxg.cloudera.com,60020,1396416642448]
          2014-04-01 22:35:28,726 INFO  [MASTER_SERVER_OPERATIONS-c2020:60000-0] master.SplitLogManager: finished splitting (more than or equal to) 1172247640 bytes in 10 log files in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2023.halxg.cloudera.com,60020,1396416642448-splitting] in 4668ms
          2014-04-01 22:35:30,039 INFO  [MASTER_SERVER_OPERATIONS-c2020:60000-0] handler.ServerShutdownHandler: Finished processing of shutdown of c2023.halxg.cloudera.com,60020,1396416642448
          

          For trunk/0.99, took 6.3seconds which is a little longer.

          2014-04-01 22:25:33,011 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2021.halxg.cloudera.com,16020,1396415234624]
          2014-04-01 22:25:39,388 INFO  [M_LOG_REPLAY_OPS-c2020:16020-1] master.SplitLogManager: finished splitting (more than or equal to) 1303360768 bytes in 11 log files in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1396415234624-splitting] in 5746ms
          

          Let me try and do a bigger test, more like what Jeffrey Zhong had over in HBASE-7006

          Show
          stack added a comment - Some rough numbers have it that they are about the same: 0.98.1 took just over 6 seconds for ten logs (but one less splitter in the cluster since master now participates on trunk) 2014-04-01 22:35:24,015 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2023.halxg.cloudera.com,60020,1396416642448] 2014-04-01 22:35:28,726 INFO [MASTER_SERVER_OPERATIONS-c2020:60000-0] master.SplitLogManager: finished splitting (more than or equal to) 1172247640 bytes in 10 log files in [hdfs: //c2020.halxg.cloudera.com:8020/hbase/WALs/c2023.halxg.cloudera.com,60020,1396416642448-splitting] in 4668ms 2014-04-01 22:35:30,039 INFO [MASTER_SERVER_OPERATIONS-c2020:60000-0] handler.ServerShutdownHandler: Finished processing of shutdown of c2023.halxg.cloudera.com,60020,1396416642448 For trunk/0.99, took 6.3seconds which is a little longer. 2014-04-01 22:25:33,011 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2021.halxg.cloudera.com,16020,1396415234624] 2014-04-01 22:25:39,388 INFO [M_LOG_REPLAY_OPS-c2020:16020-1] master.SplitLogManager: finished splitting (more than or equal to) 1303360768 bytes in 11 log files in [hdfs: //c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1396415234624-splitting] in 5746ms Let me try and do a bigger test, more like what Jeffrey Zhong had over in HBASE-7006
          Hide
          stack added a comment -

          Another test with more log files where I evened the count of regionservers has the two log splitting systems at about the same with DLR coming in just slightly faster:

          2014-04-02 09:42:57,017 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2024.halxg.cloudera.com,60020,1396454105817]
          
          2014-04-02 09:43:08,559 INFO  [MASTER_SERVER_OPERATIONS-c2020:60000-4] master.SplitLogManager: finished splitting (more than or equal to) 4358519947 bytes in 34 log files in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2024.halxg.cloudera.com,60020,1396454105817-splitting] in 11513ms
          2014-04-02 09:43:10,900 INFO  [AM.ZK.Worker-pool2-t88] master.RegionStates: Onlined a6b2b9160737269b9a745cd58e9c5112 on c2023.halxg.cloudera.com,60020,1396454098240
          

          End-to-end DLR

          2014-04-02 21:02:24,015 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2023.halxg.cloudera.com,16020,1396482188465]
          2014-04-02 21:02:37,499 INFO  [M_LOG_REPLAY_OPS-c2020:16020-1] master.SplitLogManager: finished splitting (more than or equal to) 4180462510 bytes in 33 log files in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2023.halxg.cloudera.com,16020,1396482188465-splitting] in 12645ms
          2014-04-02 21:02:37,499 INFO  [M_LOG_REPLAY_OPS-c2020:16020-1] master.DeadServer: Finished processing c2023.halxg.cloudera.com,16020,1396482188465
          

          13.5 vs 13.9seconds.

          Its only 7 regions. Is that why we don't see much difference in the timings Jeffrey Zhong? DLR does less work and facilitates further improvement in MTTR so should go in.

          Need a +1.

          Show
          stack added a comment - Another test with more log files where I evened the count of regionservers has the two log splitting systems at about the same with DLR coming in just slightly faster: 2014-04-02 09:42:57,017 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2024.halxg.cloudera.com,60020,1396454105817] 2014-04-02 09:43:08,559 INFO [MASTER_SERVER_OPERATIONS-c2020:60000-4] master.SplitLogManager: finished splitting (more than or equal to) 4358519947 bytes in 34 log files in [hdfs: //c2020.halxg.cloudera.com:8020/hbase/WALs/c2024.halxg.cloudera.com,60020,1396454105817-splitting] in 11513ms 2014-04-02 09:43:10,900 INFO [AM.ZK.Worker-pool2-t88] master.RegionStates: Onlined a6b2b9160737269b9a745cd58e9c5112 on c2023.halxg.cloudera.com,60020,1396454098240 End-to-end DLR 2014-04-02 21:02:24,015 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2023.halxg.cloudera.com,16020,1396482188465] 2014-04-02 21:02:37,499 INFO [M_LOG_REPLAY_OPS-c2020:16020-1] master.SplitLogManager: finished splitting (more than or equal to) 4180462510 bytes in 33 log files in [hdfs: //c2020.halxg.cloudera.com:8020/hbase/WALs/c2023.halxg.cloudera.com,16020,1396482188465-splitting] in 12645ms 2014-04-02 21:02:37,499 INFO [M_LOG_REPLAY_OPS-c2020:16020-1] master.DeadServer: Finished processing c2023.halxg.cloudera.com,16020,1396482188465 13.5 vs 13.9seconds. Its only 7 regions. Is that why we don't see much difference in the timings Jeffrey Zhong ? DLR does less work and facilitates further improvement in MTTR so should go in. Need a +1.
          Hide
          Jeffrey Zhong added a comment -

          +1. Looks good to me!

          Yeah, 7 regions aren't for normal application. In normal situation, we should have around 70 regions per region server. Our reference guide recommends 100 regions / region server(https://hbase.apache.org/book/regions.arch.html). Therefore, in a more real situation there will be 10 times more recovered edits files are created which will result a better performance for DLR because the number of files created/written to during recovering in DLR won't increase much. The old way(recovered edits) will have 70 * 33 small recovered edits files are created/written which are random writes.

          In current DLR, we haven't implemented SKIP_WAL recovering that's the reason we don't see performance gain with small number of regions/log files.

          The recovering for writes should be a clear win. Thanks.

          Show
          Jeffrey Zhong added a comment - +1. Looks good to me! Yeah, 7 regions aren't for normal application. In normal situation, we should have around 70 regions per region server. Our reference guide recommends 100 regions / region server( https://hbase.apache.org/book/regions.arch.html ). Therefore, in a more real situation there will be 10 times more recovered edits files are created which will result a better performance for DLR because the number of files created/written to during recovering in DLR won't increase much. The old way(recovered edits) will have 70 * 33 small recovered edits files are created/written which are random writes. In current DLR, we haven't implemented SKIP_WAL recovering that's the reason we don't see performance gain with small number of regions/log files. The recovering for writes should be a clear win . Thanks.
          Hide
          stack added a comment -

          Makes sense Jeffrey Zhong In my little tests above, DLR actually assigned twice the number of regions and was still a little faster so we are headed in right direction. Let me commit.

          What else needs to be done here Jeffrey Zhong? We should do SKIP_WAL. And then what about taking writes immediately? You have the issues handy? Thanks boss.

          Show
          stack added a comment - Makes sense Jeffrey Zhong In my little tests above, DLR actually assigned twice the number of regions and was still a little faster so we are headed in right direction. Let me commit. What else needs to be done here Jeffrey Zhong ? We should do SKIP_WAL. And then what about taking writes immediately? You have the issues handy? Thanks boss.
          Hide
          stack added a comment -

          If you have the issues handy we should peg them against 1.0 I'd say.

          Show
          stack added a comment - If you have the issues handy we should peg them against 1.0 I'd say.
          Hide
          Jeffrey Zhong added a comment -

          So far there is no pending issue. Let me try to do the SKIP_WAL thing. Before we talked about the SKIP_WAL and the reason we didn't do it is due to that will complicate chain failure recovery situation. Thanks.

          Show
          Jeffrey Zhong added a comment - So far there is no pending issue. Let me try to do the SKIP_WAL thing. Before we talked about the SKIP_WAL and the reason we didn't do it is due to that will complicate chain failure recovery situation. Thanks.
          Hide
          stack added a comment -

          Jeffrey Zhong Smile. Lets KIS if we can. What about taking writes while recovering?

          Show
          stack added a comment - Jeffrey Zhong Smile. Lets KIS if we can. What about taking writes while recovering?
          Hide
          Jeffrey Zhong added a comment -

          Writes while recovering is on by default.

          Show
          Jeffrey Zhong added a comment - Writes while recovering is on by default.
          Hide
          stack added a comment -
          Show
          stack added a comment - Jeffrey Zhong Coolio
          Hide
          stack added a comment -

          Committed. Thanks Jeffrey Zhong

          Show
          stack added a comment - Committed. Thanks Jeffrey Zhong
          Hide
          Enis Soztutar added a comment -

          Closing this issue after 0.99.0 release.

          Show
          Enis Soztutar added a comment - Closing this issue after 0.99.0 release.

            People

            • Assignee:
              stack
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development