HBase
  1. HBase
  2. HBASE-5843

Improve HBase MTTR - Mean Time To Recover

    Details

    • Type: Umbrella Umbrella
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.95.2
    • Fix Version/s: 0.96.0
    • Component/s: None
    • Labels:
      None

      Description

      A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit

      The ideal target is:

      • failure impact client applications only by an added delay to execute a query, whatever the failure.
      • this delay is always inferior to 1 second.

      We're not going to achieve that immediately...

      Priority will be given to the most frequent issues.
      Short term:

      • software crash
      • standard administrative tasks as stop/start of a cluster.

        Issue Links

          Activity

          Hide
          Nicolas Liochon added a comment -

          Small status as of June.

          • Improvements identified
            Failure detection time: performed by ZK, with a timeout. With the default value, we needed 90 seconds before starting to act on a software or hardware issue.
            Recovery time - server side: split in two parts: reassigning the regions of a dead RS to a new RS, replaying the WAL. Must be as fast as possible.
            Recovery time - client side: errors should be transparent for the user code. On the client side, we must as well limit the time lost on errors to a minimum.
            Planned rolling restart: just make this as fast and less disruptive as possible
            Other possible changes. detailed below.

          Failure detection time: hardware issue - not started
          1) as much as possible, it should be handled by ZooKeeper and not HBase, see open Jira as ZOOKEEPER-702, ZOOKEEPER-922, ...
          2) we need to make easy for a monitoring tool to tag a RS or Master as dead. This way, specialized HW tools could point out dead RS. Jira to open.

          Recovery time - Server: in progress
          1) bulk assignment: To be retested, there are many just-closed JIRA on this (HBASE-5998, HBASE-6109, HBASE-5970, ...). A lot of work by many people. There are still possible improvements (HBASE-6058, ...)
          2) Log replay: To be retested, there are many just-closed JIRA on this (HBASE-6134, ...).

          Recovery time - Client - done
          1) The RS now returns the new RS to the client after a region move (HBASE-5992, HBASE-5877)
          2) Client retries sooner on errors (HBASE-5924).
          3) In the future, it could be interesting to share the region location in ZK with the client. It's not reasonable today as it could lead to have too many connection to ZK. ZOOKEEPER-1147 is an open JIRA on this.

          Planned rolling restart performances - in progress
          Benefits from the modifications in the client mentioned above.
          To do: analyze move performances to make it faster if possible.

          Other possible changes
          Restart the server immediately on software crash: done in HBASE-5939
          Reuse the same assignment on software crash: not planned
          Use spare hardware to reuse the same assignment on hardware failure: not planned
          Multiple RS for the same region (excluded in the initial document: hbase architecture change previously discussed by Gary H./Andy P.): not planned

          Show
          Nicolas Liochon added a comment - Small status as of June. Improvements identified Failure detection time: performed by ZK, with a timeout. With the default value, we needed 90 seconds before starting to act on a software or hardware issue. Recovery time - server side: split in two parts: reassigning the regions of a dead RS to a new RS, replaying the WAL. Must be as fast as possible. Recovery time - client side: errors should be transparent for the user code. On the client side, we must as well limit the time lost on errors to a minimum. Planned rolling restart: just make this as fast and less disruptive as possible Other possible changes. detailed below. Status Failure detection time: software crash - done Done in HBASE-5844 , HBASE-5926 Failure detection time: hardware issue - not started 1) as much as possible, it should be handled by ZooKeeper and not HBase, see open Jira as ZOOKEEPER-702 , ZOOKEEPER-922 , ... 2) we need to make easy for a monitoring tool to tag a RS or Master as dead. This way, specialized HW tools could point out dead RS. Jira to open. Recovery time - Server: in progress 1) bulk assignment: To be retested, there are many just-closed JIRA on this ( HBASE-5998 , HBASE-6109 , HBASE-5970 , ...). A lot of work by many people. There are still possible improvements ( HBASE-6058 , ...) 2) Log replay: To be retested, there are many just-closed JIRA on this ( HBASE-6134 , ...). Recovery time - Client - done 1) The RS now returns the new RS to the client after a region move ( HBASE-5992 , HBASE-5877 ) 2) Client retries sooner on errors ( HBASE-5924 ). 3) In the future, it could be interesting to share the region location in ZK with the client. It's not reasonable today as it could lead to have too many connection to ZK. ZOOKEEPER-1147 is an open JIRA on this. Planned rolling restart performances - in progress Benefits from the modifications in the client mentioned above. To do: analyze move performances to make it faster if possible. Other possible changes Restart the server immediately on software crash: done in HBASE-5939 Reuse the same assignment on software crash: not planned Use spare hardware to reuse the same assignment on hardware failure: not planned Multiple RS for the same region (excluded in the initial document: hbase architecture change previously discussed by Gary H./Andy P.): not planned
          Hide
          ramkrishna.s.vasudevan added a comment -

          @N
          Just for update, I feel HBASE-6060(under progress) may also be related to MTTR.

          Show
          ramkrishna.s.vasudevan added a comment - @N Just for update, I feel HBASE-6060 (under progress) may also be related to MTTR.
          Hide
          Andrew Purtell added a comment -

          HBASE-5844 and HBASE-5926 are not in 0.94, can we port these back? Seems like pretty self contained changes.

          Show
          Andrew Purtell added a comment - HBASE-5844 and HBASE-5926 are not in 0.94, can we port these back? Seems like pretty self contained changes.
          Hide
          Nicolas Liochon added a comment -

          @ram Thanks for pointing this one out. I will wait for the fix before redoing a perf check.
          @andrew Yes, there should be no issue. HBASE-5926 modifies the content of HBASE-5844, so the merged patch will be smaller. I will have a look.

          Show
          Nicolas Liochon added a comment - @ram Thanks for pointing this one out. I will wait for the fix before redoing a perf check. @andrew Yes, there should be no issue. HBASE-5926 modifies the content of HBASE-5844 , so the merged patch will be smaller. I will have a look.
          Hide
          Nicolas Liochon added a comment -

          Some tests results:

          I tested the following scenarios, on a local machine, a pseudo
          distributed cluster with ZooKeeper and HBase writing in a ram drive,
          no datanode nor namenode, with 2 region servers, and one empty table
          with 10000 regions, 5K on each RS. Versions taken monday 2nd

          1) Clean stop of one RS; wait for all regions to become online again:
          0.92: ~800 seconds
          0.96: ~13 seconds

          => Huge improvement, hopefully from stuff like HBASE-5970 and HBASE-6109.

          1.1) As above with 2Mb memory per server
          Results as 1)

          => Results don't depend on any GC stuff (memory reported is around 200 Mb)

          2) Kill -9 of a RS; wait for all regions to become online again:
          0.92: 980s
          0.96: ~13s

          => The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but should bring similar results.

          3) Start of the cluster after a clean stop; wait for all regions to
          become online.
          0.92: ~1020s
          0.94: ~1023s (tested once only)
          0.96: ~31s

          => The benefit is visible at startup
          => This does not come from something implemented for 0.94

          4) As 3) But with HBase on a local HD
          0.92: ~1044s (tested once only)
          0.96: ~28s (tested once only)

          => Similar results. Seems that HBase i/o was not and is not becoming the bottleneck.

          5) As 1) With 4RS instead of 2
          0.92) 406s
          0.96) 6s

          => Twice faster in both cases. Scales with the number of RS with both versions on this minimalistic test.

          6) As 3) But with ZK on a local HD
          Impossible to get something consistent here. Machine and test dependent.
          The most credible result was similar to 2).
          From ZK mailing list or ZOOKEEPER-866 is seems that what we should expect.

          7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available.
          0.92) 180s detection time+ then hangs twice out of 2 tests.
          0.96) 14s (hangs once out of 3)

          => There's a bug
          => Test to be changed to get a real difference when we need to replay the wal.

          Show
          Nicolas Liochon added a comment - Some tests results: I tested the following scenarios, on a local machine, a pseudo distributed cluster with ZooKeeper and HBase writing in a ram drive, no datanode nor namenode, with 2 region servers, and one empty table with 10000 regions, 5K on each RS. Versions taken monday 2nd 1) Clean stop of one RS; wait for all regions to become online again: 0.92: ~800 seconds 0.96: ~13 seconds => Huge improvement, hopefully from stuff like HBASE-5970 and HBASE-6109 . 1.1) As above with 2Mb memory per server Results as 1) => Results don't depend on any GC stuff (memory reported is around 200 Mb) 2) Kill -9 of a RS; wait for all regions to become online again: 0.92: 980s 0.96: ~13s => The 180s gap comes from HBASE-5844 . For master, HBASE-5926 is not tested but should bring similar results. 3) Start of the cluster after a clean stop; wait for all regions to become online. 0.92: ~1020s 0.94: ~1023s (tested once only) 0.96: ~31s => The benefit is visible at startup => This does not come from something implemented for 0.94 4) As 3) But with HBase on a local HD 0.92: ~1044s (tested once only) 0.96: ~28s (tested once only) => Similar results. Seems that HBase i/o was not and is not becoming the bottleneck. 5) As 1) With 4RS instead of 2 0.92) 406s 0.96) 6s => Twice faster in both cases. Scales with the number of RS with both versions on this minimalistic test. 6) As 3) But with ZK on a local HD Impossible to get something consistent here. Machine and test dependent. The most credible result was similar to 2). From ZK mailing list or ZOOKEEPER-866 is seems that what we should expect. 7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available. 0.92) 180s detection time+ then hangs twice out of 2 tests. 0.96) 14s (hangs once out of 3) => There's a bug => Test to be changed to get a real difference when we need to replay the wal.
          Hide
          Nicolas Liochon added a comment -

          @andrew I had a look at HBASE-5844 and HBASE-5926, they have a small dependency to protobuf stuff that I had forgotten (they read the server name from zk), so it's not a pure git port.

          Show
          Nicolas Liochon added a comment - @andrew I had a look at HBASE-5844 and HBASE-5926 , they have a small dependency to protobuf stuff that I had forgotten (they read the server name from zk), so it's not a pure git port.
          Hide
          Gregory Chanan added a comment -

          Looks great so far, nkeywal.

          Some questions:

          2) Kill -9 of a RS; wait for all regions to become online again:
          0.92: 980s
          0.96: ~13s
          => The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but should bring similar results.

          I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right? Could you clarify?

          3) Start of the cluster after a clean stop; wait for all regions to
          become online.
          0.92: ~1020s
          0.94: ~1023s (tested once only)
          0.96: ~31s
          => The benefit is visible at startup
          => This does not come from something implemented for 0.94

          Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? (since I assume HBASE-5844 and HBASE-5926 do not apply in this case).

          7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available.
          0.92) 180s detection time+ then hangs twice out of 2 tests.
          0.96) 14s (hangs once out of 3)
          => There's a bug

          Has a JIRA been filed?

          Test to be changed to get a real difference when we need to replay the wal.

          Could you clarify what you mean here?

          Show
          Gregory Chanan added a comment - Looks great so far, nkeywal. Some questions: 2) Kill -9 of a RS; wait for all regions to become online again: 0.92: 980s 0.96: ~13s => The 180s gap comes from HBASE-5844 . For master, HBASE-5926 is not tested but should bring similar results. I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970 , right? Could you clarify? 3) Start of the cluster after a clean stop; wait for all regions to become online. 0.92: ~1020s 0.94: ~1023s (tested once only) 0.96: ~31s => The benefit is visible at startup => This does not come from something implemented for 0.94 Awesome.. We think this is also due to HBASE-5970 and HBASE-6109 ? (since I assume HBASE-5844 and HBASE-5926 do not apply in this case). 7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available. 0.92) 180s detection time+ then hangs twice out of 2 tests. 0.96) 14s (hangs once out of 3) => There's a bug Has a JIRA been filed? Test to be changed to get a real difference when we need to replay the wal. Could you clarify what you mean here?
          Hide
          Nicolas Liochon added a comment -

          I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right? Could you clarify?

          Yes, it's because with a clean stop, the RS unregisters itself in ZK, so the recovery starts immediately. With a kill -9, the RS remains registered in ZK. So if you don't have HBASE-5844 or HBASE-5926, you wait for the ZK timeout.

          Awesome.. We think this is also due to HBASE-5970 and HBASE-6109?

          Yes.

          Has a JIRA been filed?

          Not yet. I'm writing specific unit tests for this, I found issues that I have not yet fully analyzed, and I need to create the jiras. Also, may be my test was not good for this part: as I was doing the test without a datanode, it could be that the recovery was not working for this reason (I wonder if the sync works with the local file system for example).

          Test to be changed to get a real difference when we need to replay the wal.

          Could you clarify what you mean here?

          It's does not last long enough, so I won't be able to see much difference even if there is one. So I need to redo the work with a real datanode, check that it recovers, then check that I measure something meaningful.
          I will also redo the first tests with a DN to see if there is still a gap.

          Show
          Nicolas Liochon added a comment - I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970 , right? Could you clarify? Yes, it's because with a clean stop, the RS unregisters itself in ZK, so the recovery starts immediately. With a kill -9, the RS remains registered in ZK. So if you don't have HBASE-5844 or HBASE-5926 , you wait for the ZK timeout. Awesome.. We think this is also due to HBASE-5970 and HBASE-6109 ? Yes. Has a JIRA been filed? Not yet. I'm writing specific unit tests for this, I found issues that I have not yet fully analyzed, and I need to create the jiras. Also, may be my test was not good for this part: as I was doing the test without a datanode, it could be that the recovery was not working for this reason (I wonder if the sync works with the local file system for example). Test to be changed to get a real difference when we need to replay the wal. Could you clarify what you mean here? It's does not last long enough, so I won't be able to see much difference even if there is one. So I need to redo the work with a real datanode, check that it recovers, then check that I measure something meaningful. I will also redo the first tests with a DN to see if there is still a gap.
          Hide
          Nicolas Liochon added a comment -

          Same tests as before, with the datanodes.

          1) Clean stop of one RS; wait for all regions to become online again:
          Pseudo distributed without datanode:
          0.92: ~800 seconds
          0.96: ~13 seconds

          With two datanodes, on hadoop 1.0
          0.92: ~460 seconds
          0.96: ~12 seconds

          3) Start of the cluster after a clean stop; wait for all regions to
          Pseudo distributed without datanode:
          become online.
          0.92: ~1020s
          0.94: ~1023s (tested once only)
          0.96: ~31s

          With two datanodes, on hadoop 1.0
          0.92: ~640 seconds
          0.96: ~35 seconds

          So it seems 0.92 is faster with the DN, but we still see a major improvement.

          Show
          Nicolas Liochon added a comment - Same tests as before, with the datanodes. 1) Clean stop of one RS; wait for all regions to become online again: Pseudo distributed without datanode: 0.92: ~800 seconds 0.96: ~13 seconds With two datanodes, on hadoop 1.0 0.92: ~460 seconds 0.96: ~12 seconds 3) Start of the cluster after a clean stop; wait for all regions to Pseudo distributed without datanode: become online. 0.92: ~1020s 0.94: ~1023s (tested once only) 0.96: ~31s With two datanodes, on hadoop 1.0 0.92: ~640 seconds 0.96: ~35 seconds So it seems 0.92 is faster with the DN, but we still see a major improvement.
          Hide
          Nicolas Liochon added a comment -

          Some tests and analysis around Distributed Split

          Scenario with a single machine, killing only a region server

          • dfs.replication = 2
          • local HD. The test failed with the ramDrive.
          • Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root.
          • Insert 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each.
          • start 3 more DN & RS.
          • kill -9 the second RS (NOT the datanode).
          • Wait for the regions to be available again.

          0.94
          ~180s detection time (sometimes less, it's strange. But always more than 120s).
          ~25s split (two tasks of ~10s each per regionserver)
          ~30s assignment

          0.96
          ~0s detection time (because it's a kill -9, so we have HBASE-5844. A hw failure would bring the same result as 0.94).
          ~25s split (two tasks of ~10s each)
          ~30s assignment. Other tests seems to show that the actual time is taken on replaying the edits.

          => No difference in time here. Except the detection time, that will not play a role in a HW failure.
          => A split task in done in 10s. If you have a reasonable cluster, you can expect this task to takes 10s in production.
          => Same should apply to replaying edits, if the regions are well redistributed among the other machines.
          => Still, the assignment/replaying could be faster. Analysis to do, and JIRA to create.

          As of today, if the regionserver crashes, we should have, in production:

          • 0.94: ~50s (30s detection + 10s split + 10s assignment)
          • 0.96: ~20s (0s detection + 10s split + 10s assignment)

          If it's a HW failure, we need to take into account that we've lost a datanode as well.

          Show
          Nicolas Liochon added a comment - Some tests and analysis around Distributed Split Scenario with a single machine, killing only a region server dfs.replication = 2 local HD. The test failed with the ramDrive. Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root. Insert 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each. start 3 more DN & RS. kill -9 the second RS (NOT the datanode). Wait for the regions to be available again. 0.94 ~180s detection time (sometimes less, it's strange. But always more than 120s). ~25s split (two tasks of ~10s each per regionserver) ~30s assignment 0.96 ~0s detection time (because it's a kill -9, so we have HBASE-5844 . A hw failure would bring the same result as 0.94). ~25s split (two tasks of ~10s each) ~30s assignment. Other tests seems to show that the actual time is taken on replaying the edits. => No difference in time here. Except the detection time, that will not play a role in a HW failure. => A split task in done in 10s. If you have a reasonable cluster, you can expect this task to takes 10s in production. => Same should apply to replaying edits, if the regions are well redistributed among the other machines. => Still, the assignment/replaying could be faster. Analysis to do, and JIRA to create. As of today, if the regionserver crashes, we should have, in production: 0.94: ~50s (30s detection + 10s split + 10s assignment) 0.96: ~20s (0s detection + 10s split + 10s assignment) If it's a HW failure, we need to take into account that we've lost a datanode as well.
          Hide
          Nicolas Liochon added a comment -

          Some tests and analysis around Distributed Split / Datanodes failures

          On a real cluster, 3 nodes.

          • dfs.replication = 2
          • local HD. The test failed with the ramDrive.
          • Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root.
          • Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each, on a single server.
          • Start another box with a DN and a RS. This box is empty (no regions, no blocks).
          • Unplug (physically) the box with the 100 regions and the 1 (for 1M puts) or 8 (for 10M puts) log files.

          Durations are, in seconds. With HDFS 1.0.3 if not stated differently.

          1M puts on 0.94:
          ~180s detection time, sometimes around 150s
          ~130s split time (there is a single file to split. This is to be compared to the 10s per split above)
          ~180s assignment, included replaying edits. There could be some locks, as we're reassigning/replaying 50 regions per server.

          1M puts on 0.96 3 tests. One failure.
          ~180s detection time, sometimes around 150s
          ~180s split time. Once again a single file to split. It's unclear why it takes longer than 0.94
          ~180s assignment, as 0.94.

          Out of 3 tests, it failed once on 0.96. It didn't fail on 0.94.

          10M puts on 0.96 + HDFS branch 2 as of today
          ~180s detection time, sometimes around 150s
          ~11 minutes split. Basically it fails until HDFS nanemode marks the datanode as dead. It takes 7:30 minutes, so the split finishes after this.
          ~60s assignment? Tested only once.

          0M (zero) puts on 0.96 + HDFS branch 2 as of today
          ~180s detection time, sometimes around 150s
          ~0s split.
          ~3s assignment (This seems to say that the assignment time is spent in the edit replay.)

          10M puts on 0.96 + HDFS branch 2 + HDFS-3703 full (read + write paths)
          ~180s detection time, sometimes around 150s
          ~150s split. This for a bad reasons: all tasks excepts one succeeds. The last one seems to connect to the dead server, and finishes after ~100s. Tested twice.
          ~50s assignment. Measured once.

          So:

          • The measures on assignments are fishy. But it seems to say that we are now spending our time in replaying edit. We could have issues linked to HDFS as well here: in the last two scenarios we're not going to the dead nodes when we replay/flush edits, so that could be the reason.
          • The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should scale pretty well. We could improve things by using locality.
          • There will be datanodes errors if you don't have HDFS-3703. And in this case it becomes complicated. HBASE-6738.
          • With HDFS-3703, we're 500s faster. That's interesting.
          • Even with HDFS-3703 there is still something to look at in how HDFS connects to the dead node. It seems the block is empty, so retried multiple times. There are multiple possible paths here.
          • We can expect, in production, server side point of view
          • 30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely rebooted, ...)
          • 10s split (i.e: distributed along multiple region servers)
          • 10s assignment (i.e. distributed as well).
          • Without HDFS effects here. See above.
          • This scenario is extreme, as we're loosing 50% of our data. Still, if you're loosing a regionserver with 300 regions, the split can go well if you're not lucky.
          • It means as well that the detection time dominates the other parameters when everything goes well.

          Conclusion:

          • Link HDFS / HBase plays the critical role in this scenario. HDFS-3703 is one of the keys.
          • The Distributed Split seems to work well in terms of performances.
          • Assignment itself seems ok. Replaying should be looked at (more in terms of lock than raw performances).
          • Detection time will become more an more important.
          • An improvement would be to reassign the region in parallel of the split, with:
          • continue to serve writes before the end of the split as well: the fact that we're splitting the logs does not mean we cannot write. There are real applications that could use this (may be open tsdb for example: whatever application that logs data: they just need to know where to write).
          • continue to server reads if there are timeranged with the max time stamp before the failure: There are many applications that don't need fresh data (i.e. less than 1 minute old).
          • With this, the downtime will be totally dominated by the detection time.
          • There are JIRAs around the detection time already (basically: improve ZK and open HBase to external monitoring systems).
          • There will be some work around the client part.
          Show
          Nicolas Liochon added a comment - Some tests and analysis around Distributed Split / Datanodes failures On a real cluster, 3 nodes. dfs.replication = 2 local HD. The test failed with the ramDrive. Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root. Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each, on a single server. Start another box with a DN and a RS. This box is empty (no regions, no blocks). Unplug (physically) the box with the 100 regions and the 1 (for 1M puts) or 8 (for 10M puts) log files. Durations are, in seconds. With HDFS 1.0.3 if not stated differently. 1M puts on 0.94: ~180s detection time, sometimes around 150s ~130s split time (there is a single file to split. This is to be compared to the 10s per split above) ~180s assignment, included replaying edits. There could be some locks, as we're reassigning/replaying 50 regions per server. 1M puts on 0.96 3 tests. One failure. ~180s detection time, sometimes around 150s ~180s split time. Once again a single file to split. It's unclear why it takes longer than 0.94 ~180s assignment, as 0.94. Out of 3 tests, it failed once on 0.96. It didn't fail on 0.94. 10M puts on 0.96 + HDFS branch 2 as of today ~180s detection time, sometimes around 150s ~11 minutes split. Basically it fails until HDFS nanemode marks the datanode as dead. It takes 7:30 minutes, so the split finishes after this. ~60s assignment? Tested only once. 0M (zero) puts on 0.96 + HDFS branch 2 as of today ~180s detection time, sometimes around 150s ~0s split. ~3s assignment (This seems to say that the assignment time is spent in the edit replay.) 10M puts on 0.96 + HDFS branch 2 + HDFS-3703 full (read + write paths) ~180s detection time, sometimes around 150s ~150s split. This for a bad reasons: all tasks excepts one succeeds. The last one seems to connect to the dead server, and finishes after ~100s. Tested twice. ~50s assignment. Measured once. So: The measures on assignments are fishy. But it seems to say that we are now spending our time in replaying edit. We could have issues linked to HDFS as well here: in the last two scenarios we're not going to the dead nodes when we replay/flush edits, so that could be the reason. The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should scale pretty well. We could improve things by using locality. There will be datanodes errors if you don't have HDFS-3703 . And in this case it becomes complicated. HBASE-6738 . With HDFS-3703 , we're 500s faster. That's interesting. Even with HDFS-3703 there is still something to look at in how HDFS connects to the dead node. It seems the block is empty, so retried multiple times. There are multiple possible paths here. We can expect, in production, server side point of view 30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely rebooted, ...) 10s split (i.e: distributed along multiple region servers) 10s assignment (i.e. distributed as well). Without HDFS effects here. See above. This scenario is extreme, as we're loosing 50% of our data. Still, if you're loosing a regionserver with 300 regions, the split can go well if you're not lucky. It means as well that the detection time dominates the other parameters when everything goes well. Conclusion: Link HDFS / HBase plays the critical role in this scenario. HDFS-3703 is one of the keys. The Distributed Split seems to work well in terms of performances. Assignment itself seems ok. Replaying should be looked at (more in terms of lock than raw performances). Detection time will become more an more important. An improvement would be to reassign the region in parallel of the split, with: continue to serve writes before the end of the split as well: the fact that we're splitting the logs does not mean we cannot write. There are real applications that could use this (may be open tsdb for example: whatever application that logs data: they just need to know where to write). continue to server reads if there are timeranged with the max time stamp before the failure: There are many applications that don't need fresh data (i.e. less than 1 minute old). With this, the downtime will be totally dominated by the detection time. There are JIRAs around the detection time already (basically: improve ZK and open HBase to external monitoring systems). There will be some work around the client part.
          Hide
          stack added a comment -

          Nicolas, the above should be put on the dev list. Its great stuff. Too good to be buried out here as a comment on issue with such an unsexy subject.

          Show
          stack added a comment - Nicolas, the above should be put on the dev list. Its great stuff. Too good to be buried out here as a comment on issue with such an unsexy subject.
          Hide
          Nicolas Liochon added a comment -

          Improve HBase MTTR ? Unsexy? Really?
          More seriously: agreed. Will do.

          Show
          Nicolas Liochon added a comment - Improve HBase MTTR ? Unsexy? Really? More seriously: agreed. Will do.
          Hide
          Nicolas Liochon added a comment -

          The issues when reading even when HDFS-3703 comes from something new in HDFS 2.0. It seems that the blocks being written are not anymore in the 'main' block list, so the initial fix was not working for this block. It show that if you have the error with a single log file out of 8, you're still paying a huge price in terms of added delay...

          Show
          Nicolas Liochon added a comment - The issues when reading even when HDFS-3703 comes from something new in HDFS 2.0. It seems that the blocks being written are not anymore in the 'main' block list, so the initial fix was not working for this block. It show that if you have the error with a single log file out of 8, you're still paying a huge price in terms of added delay...
          Hide
          Nicolas Liochon added a comment -

          Test with meta:

          On a real cluster, 3 nodes.
          dfs.replication = 2
          local HD.
          Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root.
          Start another box with a DN and a RS. This box is empty (no regions, no blocks).
          Unplug the box with meta & root.
          try to create a table

          => Time taken is the recovery time of the bow holding meta. No bad surprise. It means as well that with the default zookeeper timeout you're loosing the cluster for 3 minutes if your meta regionserver dies. HBASE-6772, HBASE-6773 and HBASE-6774 would help to increase meta failure resiliency.

          Show
          Nicolas Liochon added a comment - Test with meta: On a real cluster, 3 nodes. dfs.replication = 2 local HD. Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root. Start another box with a DN and a RS. This box is empty (no regions, no blocks). Unplug the box with meta & root. try to create a table => Time taken is the recovery time of the bow holding meta. No bad surprise. It means as well that with the default zookeeper timeout you're loosing the cluster for 3 minutes if your meta regionserver dies. HBASE-6772 , HBASE-6773 and HBASE-6774 would help to increase meta failure resiliency.
          Hide
          Gregory Chanan added a comment -

          Again, great work Nicolas. Some questions/comments:

          It means as well that the detection time dominates the other parameters when everything goes well.

          Detection time will become more an more important.

          Interesting. The sad part is we often find ourselves having to increase the ZK timeout in order to deal with Juliet GC pauses. Given that detection time dominates, perhaps we should put some effort into correcting that (multiple RS on a single box?)

          Replaying should be looked at (more in terms of lock than raw performances).

          Why do you say this with respect to locking? Is the performance not as good as you would expect? Or just haven't looked at it yet?

          An improvement would be to reassign the region in parallel of the split

          I've wondered why we don't do this. Do you see any implementation challenges with doing this? Maybe I'll look into it.

          Show
          Gregory Chanan added a comment - Again, great work Nicolas. Some questions/comments: It means as well that the detection time dominates the other parameters when everything goes well. Detection time will become more an more important. Interesting. The sad part is we often find ourselves having to increase the ZK timeout in order to deal with Juliet GC pauses. Given that detection time dominates, perhaps we should put some effort into correcting that (multiple RS on a single box?) Replaying should be looked at (more in terms of lock than raw performances). Why do you say this with respect to locking? Is the performance not as good as you would expect? Or just haven't looked at it yet? An improvement would be to reassign the region in parallel of the split I've wondered why we don't do this. Do you see any implementation challenges with doing this? Maybe I'll look into it.
          Hide
          Ted Yu added a comment -

          multiple RS on a single box?

          In case that box goes down, recovery would be more costly (compared to single region server running on the box), right ?

          Show
          Ted Yu added a comment - multiple RS on a single box? In case that box goes down, recovery would be more costly (compared to single region server running on the box), right ?
          Hide
          Gregory Chanan added a comment -

          I don't know. You'd have to split the logs and replay them, which should be similar performance whether there is 1 or 2 WALs?

          Show
          Gregory Chanan added a comment - I don't know. You'd have to split the logs and replay them, which should be similar performance whether there is 1 or 2 WALs?
          Hide
          stack added a comment -

          ...multiple RS on a single box?

          Was reading today that VoltDB does offheap so can have small java heap and just run w/ defaults other than setting -Xmx and -Xms. Could try a few RS per box.. w/ small heap each (RS needs to be put on a CPU diet but could look at that afterward).

          ....An improvement would be to reassign the region in parallel of the split

          Elsewhere nkeywal puts up a suggested prescription where on crash we assign and then split logs (rather than other way round as we do now). Nicolas suggests that the assigned regions could immediately start taking writes; we'd throw an exception if a read attempt was made until after the split completed and the regions edits had been replayed: HBASE-6752

          Show
          stack added a comment - ...multiple RS on a single box? Was reading today that VoltDB does offheap so can have small java heap and just run w/ defaults other than setting -Xmx and -Xms. Could try a few RS per box.. w/ small heap each (RS needs to be put on a CPU diet but could look at that afterward). ....An improvement would be to reassign the region in parallel of the split Elsewhere nkeywal puts up a suggested prescription where on crash we assign and then split logs (rather than other way round as we do now). Nicolas suggests that the assigned regions could immediately start taking writes; we'd throw an exception if a read attempt was made until after the split completed and the regions edits had been replayed: HBASE-6752
          Hide
          Gregory Chanan added a comment -

          Was reading today that VoltDB does offheap so can have small java heap and just run w/ defaults other than setting -Xmx and -Xms. Could try a few RS per box.. w/ small heap each (RS needs to be put on a CPU diet but could look at that afterward).

          Yes, CPU definitely need a diet. Probably start with eliminating a bunch of threads.

          Elsewhere nkeywal puts up a suggested prescription where on crash we assign and then split logs (rather than other way round as we do now). Nicolas suggests that the assigned regions could immediately start taking writes; we'd throw an exception if a read attempt was made until after the split completed and the regions edits had been replayed: HBASE-6752

          Right, I think HBASE-6752 is a great idea, but it doesn't address serving reads more quickly. I'm wondering if there is more we can do to address that.

          Show
          Gregory Chanan added a comment - Was reading today that VoltDB does offheap so can have small java heap and just run w/ defaults other than setting -Xmx and -Xms. Could try a few RS per box.. w/ small heap each (RS needs to be put on a CPU diet but could look at that afterward). Yes, CPU definitely need a diet. Probably start with eliminating a bunch of threads. Elsewhere nkeywal puts up a suggested prescription where on crash we assign and then split logs (rather than other way round as we do now). Nicolas suggests that the assigned regions could immediately start taking writes; we'd throw an exception if a read attempt was made until after the split completed and the regions edits had been replayed: HBASE-6752 Right, I think HBASE-6752 is a great idea, but it doesn't address serving reads more quickly. I'm wondering if there is more we can do to address that.
          Hide
          Nicolas Liochon added a comment -

          Interesting. The sad part is we often find ourselves having to increase the ZK timeout in order to deal with Juliet GC pauses. Given that detection time dominates, perhaps we should put some effort into correcting that (multiple RS on a single box?)

          Imho, multiple RS on the same box would put us in a dead end: it increases the number of tcp connections, add workload on ZooKeeper, makes the balancer more complicated, ... We can also have operational issues (rolling upgrade, fixed tcp ports, ...).
          The possible options I know are:

          • improving ZooKeeper to have an algorithm that takes variance into account: it's a common solution to have a good failure detection while minimizing wrong positive. 20 years ago, it saved TCP from dying by congestion. There is ZOOKEEPER-702 about this. That's medium term (hopes...), but would be useful for HDFS HA also.
          • Using the new gc options available in JDK 1.7 (see http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html). That's short term, simple. Only issue, it has been tried a few month ago (by Andrew Purtell IIRC), and crashed the JVM. Still, it's something to look at, and may be we should raise the bugs to Oracle if we find some.
          • The offheap mentioned by Stack.

          I don't think it's one or another, we're likely to need all of them . Still, knowing where we stand in regards of JDK 1.7 is important imho.

          Yes, CPU definitely need a diet. Probably start with eliminating a bunch of threads.

          It's not directly MTTR, but I agree, we have far too many threads, and far too many thread pools. Not only it's bad for performance, it makes analysing the performances complicated.

          Right, I think HBASE-6752 is a great idea, but it doesn't address serving reads more quickly. I'm wondering if there is more we can do to address that.

          There is HBASE-6774 for the special case of "empty hlog" regions. It would be interesting to see how many regions are in this situation on different production clusters. There are so many ways to be in this situation... I would love to have a stat on "at a given point of time, what's the proportion of the regions with a non empty memstore". And improving memstore flush policy would lead us to improvement here as well I think.
          With HBASE-6752 we serve as well timeranged reads (if they're lucky on the range).
          But yep, we don't cover all cases. Ideas welcome

          Why do you say this with respect to locking? Is the performance not as good as you would expect? Or just haven't looked at it yet?

          I was expecting much better performances, but I haven't looked enough at it.

          I've wondered why we don't do this. Do you see any implementation challenges with doing this? Maybe I'll look into it.

          Well, it's closed to the assignment part, so... But it would be great if you can have a look at this, because with all the discussions around assignment, it's important to take these new use cases into account as well..

          Show
          Nicolas Liochon added a comment - Interesting. The sad part is we often find ourselves having to increase the ZK timeout in order to deal with Juliet GC pauses. Given that detection time dominates, perhaps we should put some effort into correcting that (multiple RS on a single box?) Imho, multiple RS on the same box would put us in a dead end: it increases the number of tcp connections, add workload on ZooKeeper, makes the balancer more complicated, ... We can also have operational issues (rolling upgrade, fixed tcp ports, ...). The possible options I know are: improving ZooKeeper to have an algorithm that takes variance into account: it's a common solution to have a good failure detection while minimizing wrong positive. 20 years ago, it saved TCP from dying by congestion. There is ZOOKEEPER-702 about this. That's medium term (hopes...), but would be useful for HDFS HA also. Using the new gc options available in JDK 1.7 (see http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html ). That's short term, simple. Only issue, it has been tried a few month ago (by Andrew Purtell IIRC), and crashed the JVM. Still, it's something to look at, and may be we should raise the bugs to Oracle if we find some. The offheap mentioned by Stack. I don't think it's one or another, we're likely to need all of them . Still, knowing where we stand in regards of JDK 1.7 is important imho. Yes, CPU definitely need a diet. Probably start with eliminating a bunch of threads. It's not directly MTTR, but I agree, we have far too many threads, and far too many thread pools. Not only it's bad for performance, it makes analysing the performances complicated. Right, I think HBASE-6752 is a great idea, but it doesn't address serving reads more quickly. I'm wondering if there is more we can do to address that. There is HBASE-6774 for the special case of "empty hlog" regions. It would be interesting to see how many regions are in this situation on different production clusters. There are so many ways to be in this situation... I would love to have a stat on "at a given point of time, what's the proportion of the regions with a non empty memstore". And improving memstore flush policy would lead us to improvement here as well I think. With HBASE-6752 we serve as well timeranged reads (if they're lucky on the range). But yep, we don't cover all cases. Ideas welcome Why do you say this with respect to locking? Is the performance not as good as you would expect? Or just haven't looked at it yet? I was expecting much better performances, but I haven't looked enough at it. I've wondered why we don't do this. Do you see any implementation challenges with doing this? Maybe I'll look into it. Well, it's closed to the assignment part, so... But it would be great if you can have a look at this, because with all the discussions around assignment, it's important to take these new use cases into account as well..
          Hide
          Gregory Chanan added a comment -

          One other question, could you explain these numbers in more detail?

          The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should scale pretty well. We could improve things by using locality.

          Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files of 60 Mb each").

          Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each, on a single server.

          The number and size of log files is independent of the number of rows? Is the size of a row in the 10M setup 1/10 the size of the row in the 1M setup?

          We can expect, in production, server side point of view
          30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely rebooted, ...)
          10s split (i.e: distributed along multiple region servers)
          10s assignment (i.e. distributed as well).

          For detection: In all the tests where we are not deleting the znode, either because on 0.94 or hw failure, the detection time is 180s. Why is this listed as 30 sec?
          For split: I think I understand this. 8 log files, 60 Mb each means 8 split tasks on 4 regionservers = 2 split tasks per RS. This took ~25 seconds, so if we had enough RS for each RS to only do 1 split task, we'd be done in about 10s.
          For assignment: Not sure where this number if coming from. I see ~30 assignment and we know this would go faster with more RS, but how fast?

          Show
          Gregory Chanan added a comment - One other question, could you explain these numbers in more detail? The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should scale pretty well. We could improve things by using locality. Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files of 60 Mb each"). Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each, on a single server. The number and size of log files is independent of the number of rows? Is the size of a row in the 10M setup 1/10 the size of the row in the 1M setup? We can expect, in production, server side point of view 30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely rebooted, ...) 10s split (i.e: distributed along multiple region servers) 10s assignment (i.e. distributed as well). For detection: In all the tests where we are not deleting the znode, either because on 0.94 or hw failure, the detection time is 180s. Why is this listed as 30 sec? For split: I think I understand this. 8 log files, 60 Mb each means 8 split tasks on 4 regionservers = 2 split tasks per RS. This took ~25 seconds, so if we had enough RS for each RS to only do 1 split task, we'd be done in about 10s. For assignment: Not sure where this number if coming from. I see ~30 assignment and we know this would go faster with more RS, but how fast?
          Hide
          Nicolas Liochon added a comment -

          Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files of 60 Mb each").
          Yes .

          The number and size of log files is independent of the number of rows? Is the size of a row in the 10M setup 1/10 the size of the row in the 1M setup?

          No / ~Yes. There's a maximum number of log files before a flush however.

          on 0.94 or hw failure, the detection time is 180s. Why is this listed as 30 sec?

          It's discussed very briefly in HBASE-5844: it was bumped recently from 60s to 180s to help newcomers. ZK default is 30s and is imho the best choice for someone requiring a reasonable mttr without too much risk.

          For split: I think I understand this. 8 log files, 60 Mb each means 8 split tasks on 4 regionservers = 2 split tasks per RS. This took ~25 seconds, so if we had enough RS for each RS to only do 1 split task, we'd be done in about 10s.

          Yes, exactly. It's a reasonable expectation imho, even if there will be other cases.

          For assignment: Not sure where this number if coming from. I see ~30 assignment and we know this would go faster with more RS, but how fast?

          The calculations seems to say that the time is spent in the replay (that's the initial results from 20/Jun/12 16:52 and the test with zero rows from 07/Sep/12 17:27). It would be useful to redo the tests, as many things changed recently on assignment. But if the results are ok, and if the cluster is big enough, the regions are distributed, we should expect only one replay. Again, it's average, not worse case.

          Show
          Nicolas Liochon added a comment - Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files of 60 Mb each"). Yes . The number and size of log files is independent of the number of rows? Is the size of a row in the 10M setup 1/10 the size of the row in the 1M setup? No / ~Yes. There's a maximum number of log files before a flush however. on 0.94 or hw failure, the detection time is 180s. Why is this listed as 30 sec? It's discussed very briefly in HBASE-5844 : it was bumped recently from 60s to 180s to help newcomers. ZK default is 30s and is imho the best choice for someone requiring a reasonable mttr without too much risk. For split: I think I understand this. 8 log files, 60 Mb each means 8 split tasks on 4 regionservers = 2 split tasks per RS. This took ~25 seconds, so if we had enough RS for each RS to only do 1 split task, we'd be done in about 10s. Yes, exactly. It's a reasonable expectation imho, even if there will be other cases. For assignment: Not sure where this number if coming from. I see ~30 assignment and we know this would go faster with more RS, but how fast? The calculations seems to say that the time is spent in the replay (that's the initial results from 20/Jun/12 16:52 and the test with zero rows from 07/Sep/12 17:27). It would be useful to redo the tests, as many things changed recently on assignment. But if the results are ok, and if the cluster is big enough, the regions are distributed, we should expect only one replay. Again, it's average, not worse case.
          Hide
          Gregory Chanan added a comment -

          No / ~Yes. There's a maximum number of log files before a flush however.

          Not sure what you mean here. Are the rows the same or not? Are there are just more flushes on the 10M case?

          Thanks for clarifying.

          Show
          Gregory Chanan added a comment - No / ~Yes. There's a maximum number of log files before a flush however. Not sure what you mean here. Are the rows the same or not? Are there are just more flushes on the 10M case? Thanks for clarifying.
          Hide
          Nicolas Liochon added a comment -

          Not sure what you mean here. Are the rows the same or not? Are there are just more flushes on the 10M case?

          Yes, it's exactly the same rows for 1M puts and 10M puts.

          Show
          Nicolas Liochon added a comment - Not sure what you mean here. Are the rows the same or not? Are there are just more flushes on the 10M case? Yes, it's exactly the same rows for 1M puts and 10M puts.
          Hide
          Nicolas Liochon added a comment -

          New scenario on datanode issue during a WAL write:

          Scenario: With Replication factor 2, Start 2 DN & 1 RS, do a first put. Start a new DN, unplug the second one. Do another put, measure the time of this second put.

          HBase trunk / HDFS 1.1: ~5 minutes
          HBase trunk / HDFS 2 branch: ~40s seconds
          HBase trunk / HDFS 2.0.2-alpha-rc3: ~40 seconds

          The time is HDFS 1.1 is spent in:
          ~66 seconds: wait for connection timeout (SocketTimeoutException: 66000 millis while waiting for the channel to be ready for read).
          then, we have two imbricated retries loops:

          • 6 retries: Failed recovery attemp #0 from primary datanode x.y.z.w:11011 -> NoRouteToHostException
          • 10 sub-retries: Retrying connect to server: deadbox/x.y.z.w:11021. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

          There are more or less 4 seconds between two sub-retries, so the total time it around:
          66 + 6 * (~4 * 10) = ~300 seconds. That's our 5 minutes.

          If we change HDFS code to have "RetryPolicies.TRY_ONCE_THEN_FAIL" vs. the default "RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)", the put succeeds in ~80 seconds.

          Conclusion:

          • time with HDFS 2.x is in line with what we have for other scenarios (~40s), so it's acceptable today.
          • time with HDFS 1.x is much less satisfying (~5 minutes!), could be easily decreased to 80s with an HDFS modification.

          Some points to think about:

          • Maybe we could decrease the timeout for WAL: we're usually writing much less data than for a memstore flush, so having more aggressive settings for the WAL makes sense. There is a (bad) side effect: we may have more false positive, and this could decrease the performances, and it will increase the workload when the cluster is globally instable. So on the long term it makes sense, but may be today is to early.
          • While the Namenode will consider the datanode as stale after 30s, we still continue trying. Again, it makes sense to lower the global workload, but it's a little bit boring... There could be optimizations if the datanode state was shared to DFSClients.
          • There are some cases that could be handled faster: ConnectionRefused means the box is there but the port is not open: no need to retry here. AndNoRouteToHostException could be considered as well as critical enough to stop trying. Again as well, this is trading global workload vs. reactivity.
          Show
          Nicolas Liochon added a comment - New scenario on datanode issue during a WAL write: Scenario: With Replication factor 2, Start 2 DN & 1 RS, do a first put. Start a new DN, unplug the second one. Do another put, measure the time of this second put. HBase trunk / HDFS 1.1: ~5 minutes HBase trunk / HDFS 2 branch: ~40s seconds HBase trunk / HDFS 2.0.2-alpha-rc3: ~40 seconds The time is HDFS 1.1 is spent in: ~66 seconds: wait for connection timeout (SocketTimeoutException: 66000 millis while waiting for the channel to be ready for read). then, we have two imbricated retries loops: 6 retries: Failed recovery attemp #0 from primary datanode x.y.z.w:11011 -> NoRouteToHostException 10 sub-retries: Retrying connect to server: deadbox/x.y.z.w:11021. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) There are more or less 4 seconds between two sub-retries, so the total time it around: 66 + 6 * (~4 * 10) = ~300 seconds. That's our 5 minutes. If we change HDFS code to have "RetryPolicies.TRY_ONCE_THEN_FAIL" vs. the default "RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)", the put succeeds in ~80 seconds. Conclusion: time with HDFS 2.x is in line with what we have for other scenarios (~40s), so it's acceptable today. time with HDFS 1.x is much less satisfying (~5 minutes!), could be easily decreased to 80s with an HDFS modification. Some points to think about: Maybe we could decrease the timeout for WAL: we're usually writing much less data than for a memstore flush, so having more aggressive settings for the WAL makes sense. There is a (bad) side effect: we may have more false positive, and this could decrease the performances, and it will increase the workload when the cluster is globally instable. So on the long term it makes sense, but may be today is to early. While the Namenode will consider the datanode as stale after 30s, we still continue trying. Again, it makes sense to lower the global workload, but it's a little bit boring... There could be optimizations if the datanode state was shared to DFSClients. There are some cases that could be handled faster: ConnectionRefused means the box is there but the port is not open: no need to retry here. AndNoRouteToHostException could be considered as well as critical enough to stop trying. Again as well, this is trading global workload vs. reactivity.
          Hide
          Nicolas Liochon added a comment -

          Some more tests: ther

          scenario: 2 RS. Create a table with 3000 regions.
          Start a third RS. Kill -15 the second one: there are 1500 regions to reassign.

          I've made multiple measures, there are all there:
          It's in seconds.
          0.96 trunk is patched with HBASE-7220 (no metrics at all).

          Creating a table with 3000 regions:
          0.92: 261s; 260s
          0.94.0: 260s; 260s
          0.94.1: 261s; 260s
          0.94 trunk: 292s; 281s; 282s;
          0.96 trunk: 173s; 178s

          Reassign after the kill
          0.92: 107s, 110s
          0.94.0: 105s; 105s
          0.94.1: 107s; 107s
          0.94 trunk: 122s; 105s; 116s
          0.96 trunk: 50s; 50s;

          Same test, but putting ZK on a separate node, 1Gb network.
          Reassign after the kill:
          0.96 trunk: 101s.

          I need to do more tests, but it seems to match the profiling I've done; the time is now spent in ZK communications.

          Show
          Nicolas Liochon added a comment - Some more tests: ther scenario: 2 RS. Create a table with 3000 regions. Start a third RS. Kill -15 the second one: there are 1500 regions to reassign. I've made multiple measures, there are all there: It's in seconds. 0.96 trunk is patched with HBASE-7220 (no metrics at all). Creating a table with 3000 regions: 0.92: 261s; 260s 0.94.0: 260s; 260s 0.94.1: 261s; 260s 0.94 trunk: 292s; 281s; 282s; 0.96 trunk: 173s; 178s Reassign after the kill 0.92: 107s, 110s 0.94.0: 105s; 105s 0.94.1: 107s; 107s 0.94 trunk: 122s; 105s; 116s 0.96 trunk: 50s; 50s; Same test, but putting ZK on a separate node, 1Gb network. Reassign after the kill: 0.96 trunk: 101s. I need to do more tests, but it seems to match the profiling I've done; the time is now spent in ZK communications.
          Hide
          Tianying Chang added a comment -

          @ nkeywal What is the application bug(AB) mentioned in your design doc? Do you mean hbase bug? or hbase client application code bug?

          If it is hbase client application code bug, does that need stop/start region server to fix the issue?

          If it is hbase code bug, do you refer to hbase bug that cause region server einter some bad state like deadlock, and so on? I think that could benefit from restarting region server to fix the problem.

          Show
          Tianying Chang added a comment - @ nkeywal What is the application bug(AB) mentioned in your design doc? Do you mean hbase bug? or hbase client application code bug? If it is hbase client application code bug, does that need stop/start region server to fix the issue? If it is hbase code bug, do you refer to hbase bug that cause region server einter some bad state like deadlock, and so on? I think that could benefit from restarting region server to fix the problem.
          Hide
          Nicolas Liochon added a comment -

          What is the application bug(AB) mentioned in your design doc? Do you mean hbase bug? or hbase client application code bug?

          Mainly HBase, but it could be as well a coprocessor issue. HBase can be configured to stop the regionserver if a coprocessor sends unexpected exceptions, but it's quite easy to write buggy stuff, like a coprocessor that takes resources without freeing them. Here you may need to stop the region server.

          If it is hbase client application code bug, does that need stop/start region server to fix the issue?

          For a pure client (i.e. a user of the hbase.client package), it would be an HBase bug imho: HBase/a regionserver should be resistant to any client behavior.
          For a coprocessor, it's client code executed within the regionserver process. Thanks to Java, many coprocessors bugs will have a limited effect, but as said above there are some cases that cannot be handled simply.

          If it is hbase code bug, do you refer to hbase bug that cause region server einter some bad state like deadlock, and so on? I think that could benefit from restarting region server to fix the problem.

          Yes.

          Show
          Nicolas Liochon added a comment - What is the application bug(AB) mentioned in your design doc? Do you mean hbase bug? or hbase client application code bug? Mainly HBase, but it could be as well a coprocessor issue. HBase can be configured to stop the regionserver if a coprocessor sends unexpected exceptions, but it's quite easy to write buggy stuff, like a coprocessor that takes resources without freeing them. Here you may need to stop the region server. If it is hbase client application code bug, does that need stop/start region server to fix the issue? For a pure client (i.e. a user of the hbase.client package), it would be an HBase bug imho: HBase/a regionserver should be resistant to any client behavior. For a coprocessor, it's client code executed within the regionserver process. Thanks to Java, many coprocessors bugs will have a limited effect, but as said above there are some cases that cannot be handled simply. If it is hbase code bug, do you refer to hbase bug that cause region server einter some bad state like deadlock, and so on? I think that could benefit from restarting region server to fix the problem. Yes.
          Hide
          Tianying Chang added a comment -

          @nkeywal That is what I thought. Thanks for the clarification!!!

          Another follow up question: how can you identify the AB problem ASAP? For example, do you conclude that there is a AB when a running hbase application read/write performance dramatically slow down? But sometimes, it could be just a temporary issue and will recover after a while. Stop/start RS will just hurt the performance due to region movement even with the MTTR improvement here. Maybe simply just testing the performance for longer time before making conclusion? Will that work? I am trying to see if there is any other better ways to identify AB problem and use graceful_stop to help improve hbase cluster performance.

          Thanks.

          Show
          Tianying Chang added a comment - @nkeywal That is what I thought. Thanks for the clarification!!! Another follow up question: how can you identify the AB problem ASAP? For example, do you conclude that there is a AB when a running hbase application read/write performance dramatically slow down? But sometimes, it could be just a temporary issue and will recover after a while. Stop/start RS will just hurt the performance due to region movement even with the MTTR improvement here. Maybe simply just testing the performance for longer time before making conclusion? Will that work? I am trying to see if there is any other better ways to identify AB problem and use graceful_stop to help improve hbase cluster performance. Thanks.
          Hide
          Nicolas Liochon added a comment -

          Agreed, you need to be a kind of expert to make the difference between 'slow for the moment' and 'broken because we have a bad internal state'. This said, if the issue comes from a coprocessor bug, you just need to be an expert of this coprocessor.

          Likely, the number of threads/memory consumed/file descriptors/... could be considered as indicators that something is going wrong, but it could be already too late for a clean stop as well...

          Show
          Nicolas Liochon added a comment - Agreed, you need to be a kind of expert to make the difference between 'slow for the moment' and 'broken because we have a bad internal state'. This said, if the issue comes from a coprocessor bug, you just need to be an expert of this coprocessor. Likely, the number of threads/memory consumed/file descriptors/... could be considered as indicators that something is going wrong, but it could be already too late for a clean stop as well...
          Hide
          Tianying Chang added a comment -

          @nkeywal Thanks!

          I have two more questions regarding your "Improving failure detection" section.

          1. For hardware failure detection, it depends on the zk session timeout value. You mentioned it is 30 sec, so mean time to detect is 15sec, which is reasonable short. But I think the default value nowdays is 180sec. (I am referring to zookeeper.session.timeout) I can see in our cluster that master does not detect the crashed machine until 3 minutes later when the ephemeral znode timeout. So it takes 3 minutes to detect hardware failure in default case. That seems pretty long and should be improved.

          2. For software bug leading to a dirty stop, you mentioned to launch region server with a script. I think we can use daemontool to start region server, and in daemontool's callback module, we can do the clean up znode based on the exit error code. daemontool can also be configured to restart the RS immediately again if needed. That seems can simplify the process. What do you think?

          Show
          Tianying Chang added a comment - @nkeywal Thanks! I have two more questions regarding your "Improving failure detection" section. 1. For hardware failure detection, it depends on the zk session timeout value. You mentioned it is 30 sec, so mean time to detect is 15sec, which is reasonable short. But I think the default value nowdays is 180sec. (I am referring to zookeeper.session.timeout) I can see in our cluster that master does not detect the crashed machine until 3 minutes later when the ephemeral znode timeout. So it takes 3 minutes to detect hardware failure in default case. That seems pretty long and should be improved. 2. For software bug leading to a dirty stop, you mentioned to launch region server with a script. I think we can use daemontool to start region server, and in daemontool's callback module, we can do the clean up znode based on the exit error code. daemontool can also be configured to restart the RS immediately again if needed. That seems can simplify the process. What do you think?
          Hide
          Nicolas Liochon added a comment -

          That seems pretty long and should be improved.

          Yep, the default was set for new users that won't understand the impact. You should change it if you care about mttr.

          That seems can simplify the process. What do you think?

          It's important to have the feature in the core, if not the feature is not tested hence does not work, or does not work for long. There is a jira to support a standard administration tool in the core (I can't find it, but it exists for sure).

          Show
          Nicolas Liochon added a comment - That seems pretty long and should be improved. Yep, the default was set for new users that won't understand the impact. You should change it if you care about mttr. That seems can simplify the process. What do you think? It's important to have the feature in the core, if not the feature is not tested hence does not work, or does not work for long. There is a jira to support a standard administration tool in the core (I can't find it, but it exists for sure).
          Hide
          Nicolas Liochon added a comment -

          Marking solved in 0.96.

          • MTTR has decreased from 10 minutes to less than one minute, ~30 seconds in many cases.
          • when a machine fails, the other machines in the cluster are still available.
          • detection time is zero when there is a crash.
          • log replay scales well, allowing a minimal replay time.
          • thanks to the new distributed wal replay, "puts" are not impacted by the recovery. Client applications can continue to stream their writes when there is a machine failure.
          • Some of the improvement were backported to 0.94 / HDFS 1.x, are now used in production and work as expected.

          The remaining ideas mentioned in this umbrella ticket will be tracked independently. There is still room for improvement.

          Show
          Nicolas Liochon added a comment - Marking solved in 0.96. MTTR has decreased from 10 minutes to less than one minute, ~30 seconds in many cases. when a machine fails, the other machines in the cluster are still available. detection time is zero when there is a crash. log replay scales well, allowing a minimal replay time. thanks to the new distributed wal replay, "puts" are not impacted by the recovery. Client applications can continue to stream their writes when there is a machine failure. Some of the improvement were backported to 0.94 / HDFS 1.x, are now used in production and work as expected. The remaining ideas mentioned in this umbrella ticket will be tracked independently. There is still room for improvement.
          Hide
          stack added a comment -

          I went over your old SU MTTR doc Nicolas Liochon I like the consideration given to the general topic. Do you think we should start up another effort to go to the next level w/ MTTR? For example, you mention cold-standbys in that old doc. There are also other efforts that would help such as the off-heap zk session pinging or getting distributed log replay so we can enable it as default, work to make read/writes during WAL replay faster (more slots), bringing regions online for writes immediately, more aggressive cleanup of outstanding WALs so less to split on crash, etc.

          Good on you Nicolas Liochon

          Show
          stack added a comment - I went over your old SU MTTR doc Nicolas Liochon I like the consideration given to the general topic. Do you think we should start up another effort to go to the next level w/ MTTR? For example, you mention cold-standbys in that old doc. There are also other efforts that would help such as the off-heap zk session pinging or getting distributed log replay so we can enable it as default, work to make read/writes during WAL replay faster (more slots), bringing regions online for writes immediately, more aggressive cleanup of outstanding WALs so less to split on crash, etc. Good on you Nicolas Liochon

            People

            • Assignee:
              Nicolas Liochon
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development