HBase
  1. HBase
  2. HBASE-4222

Make HLog more resilient to write pipeline failures

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.5
    • Component/s: wal
    • Labels:
      None

      Description

      The current implementation of HLog rolling to recover from transient errors in the write pipeline seems to have two problems:

      1. When HLog.LogSyncer triggers an IOException during time-based sync operations, it triggers a log rolling request in the corresponding catch block, but only after escaping from the internal while loop. As a result, the LogSyncer thread will exit and never be restarted from what I can tell, even if the log rolling was successful.
      2. Log rolling requests triggered by an IOException in sync() or append() never happen if no entries have yet been written to the log. This means that write errors are not immediately recovered, which extends the exposure to more errors occurring in the pipeline.

      In addition, it seems like we should be able to better handle transient problems, like a rolling restart of DataNodes while the HBase RegionServers are running. Currently this will reliably cause RegionServer aborts during log rolling: either an append or time-based sync triggers an initial IOException, initiating a log rolling request. However the log rolling then fails in closing the current writer ("All datanodes are bad"), causing a RegionServer abort. In this case, it seems like we should at least allow you an option to continue with the new writer and only abort on subsequent errors.

      1. HBASE-4222_0.90.patch
        15 kB
        Gary Helmling
      2. HBASE-4222_trunk_final.patch
        17 kB
        Gary Helmling

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2150 (See https://builds.apache.org/job/HBase-TRUNK/2150/)
          Amend HBASE-4222 Fix intermittent test failure due to region balancing

          garyh :
          Files :

          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2150 (See https://builds.apache.org/job/HBase-TRUNK/2150/ ) Amend HBASE-4222 Fix intermittent test failure due to region balancing garyh : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2144 (See https://builds.apache.org/job/HBase-TRUNK/2144/)
          Amend HBASE-4222 Fix release version, now for 0.90.5, and fix for intermittent test failure

          garyh :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2144 (See https://builds.apache.org/job/HBase-TRUNK/2144/ ) Amend HBASE-4222 Fix release version, now for 0.90.5, and fix for intermittent test failure garyh : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hide
          Gary Helmling added a comment -

          Committed to 0.90 branch and trunk.

          Show
          Gary Helmling added a comment - Committed to 0.90 branch and trunk.
          Hide
          Gary Helmling added a comment -

          Patch committed to 0.90 branch

          Show
          Gary Helmling added a comment - Patch committed to 0.90 branch
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2142 (See https://builds.apache.org/job/HBase-TRUNK/2142/)
          HBASE-4222 Allow HLog to retry log roll on transient write pipeline errors

          garyh :
          Files :

          • /hbase/trunk/src/main/resources/hbase-default.xml
          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2142 (See https://builds.apache.org/job/HBase-TRUNK/2142/ ) HBASE-4222 Allow HLog to retry log roll on transient write pipeline errors garyh : Files : /hbase/trunk/src/main/resources/hbase-default.xml /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java
          Hide
          Gary Helmling added a comment -

          Adding to 0.90 branch as well

          Show
          Gary Helmling added a comment - Adding to 0.90 branch as well
          Hide
          Gary Helmling added a comment -

          Committed patch to trunk following review.

          Show
          Gary Helmling added a comment - Committed patch to trunk following review.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/#review1586
          -----------------------------------------------------------

          Ship it!

          TestHLog and TestLogRolling passed.

          • Ted

          On 2011-08-20 05:39:30, Gary Helmling wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1590/

          -----------------------------------------------------------

          (Updated 2011-08-20 05:39:30)

          Review request for hbase.

          Summary

          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread

          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.

          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.

          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java c301d1b

          src/main/resources/hbase-default.xml 66548ca

          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 5063896

          Diff: https://reviews.apache.org/r/1590/diff

          Testing

          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/#review1586 ----------------------------------------------------------- Ship it! TestHLog and TestLogRolling passed. Ted On 2011-08-20 05:39:30, Gary Helmling wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-20 05:39:30) Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs ----- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java c301d1b src/main/resources/hbase-default.xml 66548ca src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 5063896 Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/
          -----------------------------------------------------------

          (Updated 2011-08-20 05:39:30.112513)

          Review request for hbase.

          Changes
          -------

          Rebased patch against latest trunk, including HBASE-4095 changes. Changes are:

          • shift TestLogRolling mini-cluster startup from pre-class to pre-test. Following the HBASE-4095 changes, the new test method, testLogRollOnPipelineRestart(), was hanging from the previous tests' cluster manipulations.
          • add a default setting of hbase.regionserver.logroll.errors.tolerated=2 to hbase-default.xml

          Summary
          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread
          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.
          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.
          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83
          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java c301d1b
          src/main/resources/hbase-default.xml 66548ca
          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 5063896

          Diff: https://reviews.apache.org/r/1590/diff

          Testing
          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-20 05:39:30.112513) Review request for hbase. Changes ------- Rebased patch against latest trunk, including HBASE-4095 changes. Changes are: shift TestLogRolling mini-cluster startup from pre-class to pre-test. Following the HBASE-4095 changes, the new test method, testLogRollOnPipelineRestart(), was hanging from the previous tests' cluster manipulations. add a default setting of hbase.regionserver.logroll.errors.tolerated=2 to hbase-default.xml Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs (updated) src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java c301d1b src/main/resources/hbase-default.xml 66548ca src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 5063896 Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-08-19 18:58:21, Michael Stack wrote:

          >

          Will post an update with a default setting of 2 in hbase-default.xml and some fixes to TestLogRolling – my additional test is not playing nicely with the HBASE-4095 changes there at the moment.

          On 2011-08-19 18:58:21, Michael Stack wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java, line 87

          > <https://reviews.apache.org/r/1590/diff/1/?file=33750#file33750line87>

          >

          > How do you manually roll a log? I want that.

          Probably wouldn't be too hard to add a RPC call and shell command to manually trigger a roll. That would be nice to have, but I'll leave it for a separate issue.

          (Log message just means triggered by HLog.requestLogRoll(), meaning from an IOException, or current log size, or replica count below threshold).

          • Gary

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/#review1557
          -----------------------------------------------------------

          On 2011-08-19 18:33:11, Gary Helmling wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1590/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:33:11)

          Review request for hbase.

          Summary

          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread

          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.

          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.

          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736

          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

          Diff: https://reviews.apache.org/r/1590/diff

          Testing

          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-08-19 18:58:21, Michael Stack wrote: > Will post an update with a default setting of 2 in hbase-default.xml and some fixes to TestLogRolling – my additional test is not playing nicely with the HBASE-4095 changes there at the moment. On 2011-08-19 18:58:21, Michael Stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java, line 87 > < https://reviews.apache.org/r/1590/diff/1/?file=33750#file33750line87 > > > How do you manually roll a log? I want that. Probably wouldn't be too hard to add a RPC call and shell command to manually trigger a roll. That would be nice to have, but I'll leave it for a separate issue. (Log message just means triggered by HLog.requestLogRoll(), meaning from an IOException, or current log size, or replica count below threshold). Gary ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/#review1557 ----------------------------------------------------------- On 2011-08-19 18:33:11, Gary Helmling wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-19 18:33:11) Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs ----- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/#review1567
          -----------------------------------------------------------

          Ship it!

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          <https://reviews.apache.org/r/1590/#comment3542>

          Good to keep the default here the current behavior, but we should maybe set this to 2 or 3 in hbase-default.xml ?

          • Andrew

          On 2011-08-19 18:33:11, Gary Helmling wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1590/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:33:11)

          Review request for hbase.

          Summary

          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread

          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.

          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.

          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736

          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

          Diff: https://reviews.apache.org/r/1590/diff

          Testing

          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/#review1567 ----------------------------------------------------------- Ship it! src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java < https://reviews.apache.org/r/1590/#comment3542 > Good to keep the default here the current behavior, but we should maybe set this to 2 or 3 in hbase-default.xml ? Andrew On 2011-08-19 18:33:11, Gary Helmling wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-19 18:33:11) Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs ----- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          Ted Yu added a comment -

          @Gary:
          Can you rebase the patch now that HBASE-4095 got integrated ?

          Hunk #7 succeeded at 1055 (offset 21 lines).
          1 out of 7 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java.rej
          patching file src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hunk #1 FAILED at 19.
          Hunk #2 FAILED at 67.
          Hunk #3 succeeded at 122 (offset -2 lines).
          Hunk #4 succeeded at 378 with fuzz 2 (offset 42 lines).
          2 out of 4 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java.rej
          

          Thanks

          Show
          Ted Yu added a comment - @Gary: Can you rebase the patch now that HBASE-4095 got integrated ? Hunk #7 succeeded at 1055 (offset 21 lines). 1 out of 7 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java.rej patching file src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java Hunk #1 FAILED at 19. Hunk #2 FAILED at 67. Hunk #3 succeeded at 122 (offset -2 lines). Hunk #4 succeeded at 378 with fuzz 2 (offset 42 lines). 2 out of 4 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java.rej Thanks
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/#review1557
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java
          <https://reviews.apache.org/r/1590/#comment3531>

          How do you manually roll a log? I want that.

          • Michael

          On 2011-08-19 18:33:11, Gary Helmling wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1590/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:33:11)

          Review request for hbase.

          Summary

          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread

          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.

          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.

          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736

          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

          Diff: https://reviews.apache.org/r/1590/diff

          Testing

          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/#review1557 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java < https://reviews.apache.org/r/1590/#comment3531 > How do you manually roll a log? I want that. Michael On 2011-08-19 18:33:11, Gary Helmling wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-19 18:33:11) Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs ----- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-08-19 18:55:23, Michael Stack wrote:

          >

          Should we default to two errors?

          • Michael

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/#review1555
          -----------------------------------------------------------

          On 2011-08-19 18:33:11, Gary Helmling wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1590/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:33:11)

          Review request for hbase.

          Summary

          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread

          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.

          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.

          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736

          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

          Diff: https://reviews.apache.org/r/1590/diff

          Testing

          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-08-19 18:55:23, Michael Stack wrote: > Should we default to two errors? Michael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/#review1555 ----------------------------------------------------------- On 2011-08-19 18:33:11, Gary Helmling wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-19 18:33:11) Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs ----- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          Ted Yu added a comment -

          In point 2 of description on review board:

          This would prevent the log from being recovered in a timely manner.

          Is the above description accurate.

          Nice work Gary.

          Show
          Ted Yu added a comment - In point 2 of description on review board: This would prevent the log from being recovered in a timely manner. Is the above description accurate. Nice work Gary.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/#review1555
          -----------------------------------------------------------

          Ship it!

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          <https://reviews.apache.org/r/1590/#comment3530>

          oops

          • Michael

          On 2011-08-19 18:33:11, Gary Helmling wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1590/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:33:11)

          Review request for hbase.

          Summary

          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread

          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.

          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.

          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83

          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736

          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

          Diff: https://reviews.apache.org/r/1590/diff

          Testing

          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/#review1555 ----------------------------------------------------------- Ship it! src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java < https://reviews.apache.org/r/1590/#comment3530 > oops Michael On 2011-08-19 18:33:11, Gary Helmling wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- (Updated 2011-08-19 18:33:11) Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs ----- src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1590/
          -----------------------------------------------------------

          Review request for hbase.

          Summary
          -------

          This patch corrects a few problems, as I see it, with the current log rolling process:

          1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread
          2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner.
          3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort

          The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered.

          Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting).

          With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service.

          I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that.

          This addresses bug HBASE-4222.
          https://issues.apache.org/jira/browse/HBASE-4222

          Diffs


          src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83
          src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736
          src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

          Diff: https://reviews.apache.org/r/1590/diff

          Testing
          -------

          Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().

          Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity.

          Thanks,

          Gary

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1590/ ----------------------------------------------------------- Review request for hbase. Summary ------- This patch corrects a few problems, as I see it, with the current log rolling process: 1) HLog.LogSyncer.run() now handles an IOException in the inner while loop. Previously any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll succeeded. This would mean the region server kept running without a LogSyncer thread 2) Log rolls triggered by IOExceptions were being skipped in the event that there were no entries in the log. This would prevent the log from being recovered in a timely manner. 3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(), resulting in inaccurate logging on RS abort The bigger change is the addition of a configuration property – hbase.regionserver.logroll.errors.tolerated – that is checked against a counter of consecutive close errors to see whether or not an abort should be triggered. Prior to this patch, we could readily trigger region server aborts by rolling all the data nodes in a cluster while region servers were running. This was equally true whether write activity was happening or not. (In fact I think having concurrent write activity actually gave a better chance for the log to be rolled prior to all DNs in the write pipeline going down and thus the region server not aborting). With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at will without causing any loss of service. I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing here. If there are alternate/better ways to handle this in the DFSClient layer, I'd also appreciate any pointers to that. This addresses bug HBASE-4222 . https://issues.apache.org/jira/browse/HBASE-4222 Diffs src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb Diff: https://reviews.apache.org/r/1590/diff Testing ------- Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart(). Tested patch on a running cluster with 3 slaves, rolling data nodes with and without concurrent write activity. Thanks, Gary
          Hide
          Ted Yu added a comment -

          @Gary:
          Can you publish your solution ?
          Every HBase user is experiencing RegionServer aborts described in this JIRA.

          Thanks

          Show
          Ted Yu added a comment - @Gary: Can you publish your solution ? Every HBase user is experiencing RegionServer aborts described in this JIRA. Thanks
          Hide
          Andrew Purtell added a comment -

          I presume a patch or RB post is coming soon.

          Show
          Andrew Purtell added a comment - I presume a patch or RB post is coming soon.
          Hide
          Andrew Purtell added a comment -

          +1 We've tested this on EC2 clusters and it works.

          Show
          Andrew Purtell added a comment - +1 We've tested this on EC2 clusters and it works.

            People

            • Assignee:
              Gary Helmling
              Reporter:
              Gary Helmling
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development