[HBASE-5699] Run with > 1 WAL in HRegionServer - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0, 1.1.0
Component/s: Performance, wal
Labels:
None

Release Note:

Hide
HBase's write-ahead-log (WAL) can now be configured to use multiple HDFS pipelines in parallel to provide better write throughput for clusters by using additional disks. By default, HBase will still use only a single HDFS-based WAL.

To run with multiple WALs, alter the hbase-site.xml property "hbase.wal.provider" to have the value "multiwal". To return to having HBase determine what kind of WAL implementation to use either remove the property all together or set it to "defaultProvider".

Altering the WAL provider used by a particular RegionServer requires restarting that instance. RegionServers using the original WAL implementation and those using the "multiwal" implementation can each handle recovery of either set of WALs, so a zero-downtime configuration update is possible through a rolling restart.

This issue introduces the following configurations:

hbase.wal.regiongrouping.numgroups is how many provider instances the 'multiwal' should create. Default is two.

hbase.wal.regiongrouping.strategy is the strategy to use figuring which provider instance an edit should go to. Default is 'identity'. Identity is the only current built-in option.

hbase.wal.regiongrouping.delegate is the type of the provider the multiwal creates. Default is default from WALFactory.

Show
HBase's write-ahead-log (WAL) can now be configured to use multiple HDFS pipelines in parallel to provide better write throughput for clusters by using additional disks. By default, HBase will still use only a single HDFS-based WAL. To run with multiple WALs, alter the hbase-site.xml property "hbase.wal.provider" to have the value "multiwal". To return to having HBase determine what kind of WAL implementation to use either remove the property all together or set it to "defaultProvider". Altering the WAL provider used by a particular RegionServer requires restarting that instance. RegionServers using the original WAL implementation and those using the "multiwal" implementation can each handle recovery of either set of WALs, so a zero-downtime configuration update is possible through a rolling restart. This issue introduces the following configurations: hbase.wal.regiongrouping.numgroups is how many provider instances the 'multiwal' should create. Default is two. hbase.wal.regiongrouping.strategy is the strategy to use figuring which provider instance an edit should go to. Default is 'identity'. Identity is the only current built-in option. hbase.wal.regiongrouping.delegate is the type of the provider the multiwal creates. Default is default from WALFactory.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-5699_#workers_vs_MiB_per_s_1x1col_512Bval_wal_count_1,2,4.tiff
15/Dec/14 17:50
138 kB
Sean Busbey
HBASE-5699_disabled_and_regular_#workers_vs_MiB_per_s_1x1col_512Bval_wal_count_1,2,4.tiff
16/Dec/14 17:54
100 kB
Sean Busbey
hbase-5699_multiwal_400-threads_stats_sync_heavy.txt
10/Dec/14 19:18
5 kB
Sean Busbey
hbase-5699_total_throughput_sync_heavy.txt
10/Dec/14 19:18
11 kB
Sean Busbey
HBASE-5699_write_iops_multiwal-1_1_to_200_threads.tiff
10/Dec/14 19:18
138 kB
Sean Busbey
HBASE-5699_write_iops_multiwal-2_10,50,120,190,260,330,400_threads.tiff
10/Dec/14 19:18
67 kB
Sean Busbey
HBASE-5699_write_iops_multiwal-4_10,50,120,190,260,330,400_threads.tiff
10/Dec/14 19:18
66 kB
Sean Busbey
HBASE-5699_write_iops_multiwal-6_10,50,120,190,260,330,400_threads.tiff
10/Dec/14 19:18
66 kB
Sean Busbey
HBASE-5699_write_iops_upstream_1_to_200_threads.tiff
10/Dec/14 19:18
140 kB
Sean Busbey
HBASE-5699.3.patch.txt
10/Dec/14 18:33
22 kB
Sean Busbey
HBASE-5699.4.patch.txt
15/Dec/14 14:06
23 kB
Sean Busbey
PerfHbase.txt
04/Jun/12 13:57
40 kB
ramkrishna.s.vasudevan
results-hbase5699-upstream.txt.bz2
10/Dec/14 19:18
1.19 MB
Sean Busbey
results-hbase5699-wals-1.txt.bz2
10/Dec/14 19:18
1.20 MB
Sean Busbey
results-updated-hbase5699-wals-2.txt.bz2
10/Dec/14 19:18
262 kB
Sean Busbey
results-updated-hbase5699-wals-4.txt.bz2
10/Dec/14 19:18
284 kB
Sean Busbey
results-updated-hbase5699-wals-6.txt.bz2
10/Dec/14 19:18
315 kB
Sean Busbey

Issue Links

depends upon

HBASE-10378 Divide HLog interface into User and Implementor specific interfaces

Closed

is depended upon by

HBASE-8338 Latency Resilience; umbrella list of issues that will help us ride over bad disk, bad region, ec2, etc.

Closed

is duplicated by

HBASE-6981 multiple commit logs per region server

Closed

is related to

HBASE-5937 Refactor HLog into an interface.

Closed

HBASE-6116 Allow parallel HDFS writes for HLogs.

Closed

HBASE-11270 Separate replication table's hlog from nonreplication table

Closed

HBASE-14457 Umbrella: Improve Multiple WAL for production usage

Closed

relates to

HBASE-10278 Provide better write predictability

Closed

ACCUMULO-1083 add concurrency to HDFS write-ahead log

Resolved

HBASE-8610 Introduce interfaces to support MultiWAL

Closed

links to

reviewboard

(2 is related to, 3 relates to, 1 links to)

Sub-Tasks

Refactor HLog into an interface.

Closed

Flavio Paiva Junqueira

Activity

Ascending order - Click to sort in descending order

Michael Stack added a comment - 02/Apr/12 18:34 - edited

Please provide more detail on what this issue is about and correct the subject so it's properly spelled. Thanks.

Michael Stack added a comment - 02/Apr/12 18:34 - edited Please provide more detail on what this issue is about and correct the subject so it's properly spelled. Thanks.

Lijin Bin added a comment - 04/Apr/12 09:57

@stack,
There is only one HLog and a Writer in a HRegionServer for write-ahead-log, at any time only one writer takes the HLog lock and the other will wait. If there are muti HLog or Writer, the write-ahead-log can be run parallel, the write performance should be improved.

Lijin Bin added a comment - 04/Apr/12 09:57 @stack, There is only one HLog and a Writer in a HRegionServer for write-ahead-log, at any time only one writer takes the HLog lock and the other will wait. If there are muti HLog or Writer, the write-ahead-log can be run parallel, the write performance should be improved.

Michael Stack added a comment - 04/Apr/12 15:27

Yes. This topic comes up from time to time. Would be nice to try it out. It is possible to stand up the WAL subsystem on its own so you could experiment having HLog output to > 1 WAL. A bunch of us would be interested in what you learn.

Michael Stack added a comment - 04/Apr/12 15:27 Yes. This topic comes up from time to time. Would be nice to try it out. It is possible to stand up the WAL subsystem on its own so you could experiment having HLog output to > 1 WAL. A bunch of us would be interested in what you learn.

Juhani Connolly added a comment - 05/Apr/12 03:02

Since we have had some similar experience posting it here:
We are finding most of our IPC threads in our region servers locked into HWal.append(42 out of 50. Of those 20 are in sync, and one is actually working... As is to be expected).
We made the presumption that the problem was with the WAL synchronisation mechanisms holding things up and decided to try running multiple RS per node since we had significant amount of free CPU and memory resources as well as many barely active hard disks.
By running 3 RS per node, we saw our application specific throughput go from 7k events to 18k. Each event is made up of roughly 2 writes and 2 increments, plus some reads/scans which shouldn't be touching the WAL.
This situation is partially also just due to a very high spec per node. I don't think it would be necessary on more "commodity" type servers, but the option to use multiple WAL's on each region server may well give some significant throughput gains for some hardware setups.

Juhani Connolly added a comment - 05/Apr/12 03:02 Since we have had some similar experience posting it here: We are finding most of our IPC threads in our region servers locked into HWal.append(42 out of 50. Of those 20 are in sync, and one is actually working... As is to be expected). We made the presumption that the problem was with the WAL synchronisation mechanisms holding things up and decided to try running multiple RS per node since we had significant amount of free CPU and memory resources as well as many barely active hard disks. By running 3 RS per node, we saw our application specific throughput go from 7k events to 18k. Each event is made up of roughly 2 writes and 2 increments, plus some reads/scans which shouldn't be touching the WAL. This situation is partially also just due to a very high spec per node. I don't think it would be necessary on more "commodity" type servers, but the option to use multiple WAL's on each region server may well give some significant throughput gains for some hardware setups.

Lijin Bin added a comment - 09/Apr/12 04:00

@stack,
I run a test with 0.90 version use 10 writer and 3 nodes, some times it has double write performance, may be it not very well.

Lijin Bin added a comment - 09/Apr/12 04:00 @stack, I run a test with 0.90 version use 10 writer and 3 nodes, some times it has double write performance, may be it not very well.

Chunhui Shen added a comment - 09/Apr/12 05:42

I think the number of datanodes is a litte few in the test. Using double hlogWrites in RS, write performance should be nearly double except limit by the HDFS.

Chunhui Shen added a comment - 09/Apr/12 05:42 I think the number of datanodes is a litte few in the test. Using double hlogWrites in RS, write performance should be nearly double except limit by the HDFS.

Michael Stack added a comment - 09/Apr/12 16:08

@binlijin What Chunhui says. I'd think that if it were a bigger cluster you'd see a more marked improvement. What about recovery? How does log splitting work with all the extra WALs?

Michael Stack added a comment - 09/Apr/12 16:08 @binlijin What Chunhui says. I'd think that if it were a bigger cluster you'd see a more marked improvement. What about recovery? How does log splitting work with all the extra WALs?

Lijin Bin added a comment - 10/Apr/12 01:27

I just run a test and don't test the recovery and the others.

Lijin Bin added a comment - 10/Apr/12 01:27 I just run a test and don't test the recovery and the others.

Li Pi added a comment - 10/Apr/12 06:08

This seems interesting. I'll take a look at doing this.

Li Pi added a comment - 10/Apr/12 06:08 This seems interesting. I'll take a look at doing this.

Michael Stack added a comment - 23/Apr/12 22:00

@Ted Why delete a comment, especially someone elses?

Michael Stack added a comment - 23/Apr/12 22:00 @Ted Why delete a comment, especially someone elses?

Ted Yu added a comment - 23/Apr/12 22:03

It was a duplicate message.

Ted Yu added a comment - 23/Apr/12 22:03 It was a duplicate message.

Michael Stack added a comment - 23/Apr/12 22:24

@Ted Would suggest you just leave it. When you delete, we all get a message in our mailbox about the delete transaction. Then we start to wonder...

Michael Stack added a comment - 23/Apr/12 22:24 @Ted Would suggest you just leave it. When you delete, we all get a message in our mailbox about the delete transaction. Then we start to wonder...

Ted Yu added a comment - 30/Apr/12 18:17

Playing with a prototype of this feature using ycsb (half insert, half upate) on a 5-node cluster where usertable has 13 regions on each region server.
Without this feature:

 10 sec: 99965 operations; 9996.5 current ops/sec; [UPDATE AverageLatency(us)=258.68] [INSERT AverageLatency(us)=610.28]
 20 sec: 99965 operations; 0 current ops/sec;
 25 sec: 99990 operations; 4.3 current ops/sec; [UPDATE AverageLatency(us)=2594303.62] [INSERT AverageLatency(us)=1240495.41]
[OVERALL], RunTime(ms), 25844.0
[OVERALL], Throughput(ops/sec), 3868.9831295465096
[UPDATE], Operations, 49935
[UPDATE], AverageLatency(us), 674.2635626314209

with this feature:

 10 sec: 99952 operations; 9994.2 current ops/sec; [UPDATE AverageLatency(us)=178.7] [INSERT AverageLatency(us)=584.76]
 20 sec: 99990 operations; 3.8 current ops/sec; [UPDATE AverageLatency(us)=10.88] [INSERT AverageLatency(us)=679174.27]
 20 sec: 99990 operations; 0 current ops/sec;
[OVERALL], RunTime(ms), 20867.0
[OVERALL], Throughput(ops/sec), 4791.776489193463
[UPDATE], Operations, 49992
[UPDATE], AverageLatency(us), 178.6439030244839

Ted Yu added a comment - 30/Apr/12 18:17 Playing with a prototype of this feature using ycsb (half insert, half upate) on a 5-node cluster where usertable has 13 regions on each region server. Without this feature: 10 sec: 99965 operations; 9996.5 current ops/sec; [UPDATE AverageLatency(us)=258.68] [INSERT AverageLatency(us)=610.28] 20 sec: 99965 operations; 0 current ops/sec; 25 sec: 99990 operations; 4.3 current ops/sec; [UPDATE AverageLatency(us)=2594303.62] [INSERT AverageLatency(us)=1240495.41] [OVERALL], RunTime(ms), 25844.0 [OVERALL], Throughput(ops/sec), 3868.9831295465096 [UPDATE], Operations, 49935 [UPDATE], AverageLatency(us), 674.2635626314209 with this feature: 10 sec: 99952 operations; 9994.2 current ops/sec; [UPDATE AverageLatency(us)=178.7] [INSERT AverageLatency(us)=584.76] 20 sec: 99990 operations; 3.8 current ops/sec; [UPDATE AverageLatency(us)=10.88] [INSERT AverageLatency(us)=679174.27] 20 sec: 99990 operations; 0 current ops/sec; [OVERALL], RunTime(ms), 20867.0 [OVERALL], Throughput(ops/sec), 4791.776489193463 [UPDATE], Operations, 49992 [UPDATE], AverageLatency(us), 178.6439030244839

Elliott Neil Clark added a comment - 30/Apr/12 18:59

Intuitively it seems like the number of WAL's that are used should be related to the number of spindles available to hbase. So maybe this should be either a configurable number or something that is derived from the number of mount points hdfs is hosted on ?

Elliott Neil Clark added a comment - 30/Apr/12 18:59 Intuitively it seems like the number of WAL's that are used should be related to the number of spindles available to hbase. So maybe this should be either a configurable number or something that is derived from the number of mount points hdfs is hosted on ?

Ted Yu added a comment - 30/Apr/12 19:21

Currently I use the following knob for the maximum number of WAL's on an individual region server:

+    int totalInstances = conf.getInt("hbase.regionserver.hlog.total", DEFAULT_MAX_HLOG_INSTANCES);

Ted Yu added a comment - 30/Apr/12 19:21 Currently I use the following knob for the maximum number of WAL's on an individual region server: + int totalInstances = conf.getInt( "hbase.regionserver.hlog.total" , DEFAULT_MAX_HLOG_INSTANCES);

Jean-Daniel Cryans added a comment - 30/Apr/12 20:05

Intuitively it seems like the number of WAL's that are used should be related to the number of spindles available to hbase.

I disagree, considering that most of the deployments have rep=3 you're using three spindles not one. The multiplying effect could generate a lot of disk seeks since the WALs are competing like that (plus flushing, compacting, etc).

Jean-Daniel Cryans added a comment - 30/Apr/12 20:05 Intuitively it seems like the number of WAL's that are used should be related to the number of spindles available to hbase. I disagree, considering that most of the deployments have rep=3 you're using three spindles not one. The multiplying effect could generate a lot of disk seeks since the WALs are competing like that (plus flushing, compacting, etc).

Todd Lipcon added a comment - 30/Apr/12 21:56

I disagree, considering that most of the deployments have rep=3 you're using three spindles not one

That said, most of our customers are deploying 6 disks if not 12

IMO the other big gain we can get from multiple WALs is to automatically switch between WALs when one gets "slow". IMO we should maintain a count of outstanding requests (probably by size) for each WAL, and submit writes to whichever has fewer outstanding requests. That way if one is faster, it will take more of the load. Then simultaneously measure trailing latency stats on each WAL, and if one is significantly slower than the other for some period of time, have it roll (to try to get a new set of disks/nodes)

Todd Lipcon added a comment - 30/Apr/12 21:56 I disagree, considering that most of the deployments have rep=3 you're using three spindles not one That said, most of our customers are deploying 6 disks if not 12 IMO the other big gain we can get from multiple WALs is to automatically switch between WALs when one gets "slow". IMO we should maintain a count of outstanding requests (probably by size) for each WAL, and submit writes to whichever has fewer outstanding requests. That way if one is faster, it will take more of the load. Then simultaneously measure trailing latency stats on each WAL, and if one is significantly slower than the other for some period of time, have it roll (to try to get a new set of disks/nodes)

Li Pi added a comment - 30/Apr/12 22:01

Agree with todd on the implementation details. The switching of logs should also serve to help balance our log writes.

Li Pi added a comment - 30/Apr/12 22:01 Agree with todd on the implementation details. The switching of logs should also serve to help balance our log writes.

Ted Yu added a comment - 30/Apr/12 23:54

Trying to understand the implication of Todd's suggestion above.
Currently each HRegion has reference to the HLog it uses. If requests can be freely redirected to the HLog instance having fewer outstanding requests, the reference would be to that of the region server.
This means additional logic on region server for dispatching the write requests.

Ted Yu added a comment - 30/Apr/12 23:54 Trying to understand the implication of Todd's suggestion above. Currently each HRegion has reference to the HLog it uses. If requests can be freely redirected to the HLog instance having fewer outstanding requests, the reference would be to that of the region server. This means additional logic on region server for dispatching the write requests.

Jonathan Hsieh added a comment - 01/May/12 01:03

Part of the motivation for multiple wals can be found in this tech talk: (most relavent to HBase is backup requests, starting slide 39)

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf

Jonathan Hsieh added a comment - 01/May/12 01:03 Part of the motivation for multiple wals can be found in this tech talk: (most relavent to HBase is backup requests, starting slide 39) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf

Jonathan Hsieh added a comment - 01/May/12 01:05

The argument here is mostly aimed at read latency, but a similar idea could be used for write latency as well.

Jonathan Hsieh added a comment - 01/May/12 01:05 The argument here is mostly aimed at read latency, but a similar idea could be used for write latency as well.

Todd Lipcon added a comment - 01/May/12 01:44

Currently each HRegion has reference to the HLog it uses. If requests can be freely redirected to the HLog instance having fewer outstanding requests, the reference would be to that of the region server.

Sorry, I should be less free-wheeling with my terminology. My thought was that there is still a single "HLog" class, but underneath it would be multiple "SequenceFileLogWriters", most likely. Though maybe the correct implementation is to make HLog an interface, and then have a MultiHLog which wraps N other HLogs or something. Either way, any region would only have a reference to one "HLog" object, which might have more than one underlying stream.

Todd Lipcon added a comment - 01/May/12 01:44 Currently each HRegion has reference to the HLog it uses. If requests can be freely redirected to the HLog instance having fewer outstanding requests, the reference would be to that of the region server. Sorry, I should be less free-wheeling with my terminology. My thought was that there is still a single "HLog" class, but underneath it would be multiple "SequenceFileLogWriters", most likely. Though maybe the correct implementation is to make HLog an interface, and then have a MultiHLog which wraps N other HLogs or something. Either way, any region would only have a reference to one "HLog" object, which might have more than one underlying stream.

Ted Yu added a comment - 01/May/12 02:19

to one "HLog" object, which might have more than one underlying stream.

The above can be a (sub-)task by itself.

Ted Yu added a comment - 01/May/12 02:19 to one "HLog" object, which might have more than one underlying stream. The above can be a (sub-)task by itself.

Ted Yu added a comment - 01/May/12 03:12

Currently we maintain one sequence number per region per HLog. From append():

      this.lastSeqWritten.putIfAbsent(regionInfo.getEncodedNameAsBytes(),
        Long.valueOf(seqNum));

If WALEdit's from a particular region can spread across multiple streams, accounting would be more complex.

Ted Yu added a comment - 01/May/12 03:12 Currently we maintain one sequence number per region per HLog. From append(): this .lastSeqWritten.putIfAbsent(regionInfo.getEncodedNameAsBytes(), Long .valueOf(seqNum)); If WALEdit's from a particular region can spread across multiple streams, accounting would be more complex.

ramkrishna.s.vasudevan added a comment - 01/May/12 04:28

Do we need to gurantee the HLog edits sequencing even with multiple WALs? Just referring to Stack's comment in
https://issues.apache.org/jira/browse/HBASE-5782?focusedCommentId=13255344&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13255344

ramkrishna.s.vasudevan added a comment - 01/May/12 04:28 Do we need to gurantee the HLog edits sequencing even with multiple WALs? Just referring to Stack's comment in https://issues.apache.org/jira/browse/HBASE-5782?focusedCommentId=13255344&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13255344

Li Pi added a comment - 04/May/12 01:16

I'm assuming we don't need to guarantee HLog edit sequencing. If we do, this becomes a bit harder.

Li Pi added a comment - 04/May/12 01:16 I'm assuming we don't need to guarantee HLog edit sequencing. If we do, this becomes a bit harder.

Ted Yu added a comment - 04/May/12 02:07

Are replication related unit tests passing ?

Since the review process would at least take a month, I think developing against a branch would be good practice.

Ted Yu added a comment - 04/May/12 02:07 Are replication related unit tests passing ? Since the review process would at least take a month, I think developing against a branch would be good practice.

Li Pi added a comment - 04/May/12 02:34

Replication is the failure point. I haven't really worked on that yet.

Talked to Jon about the dev process. I'll create a seperate Jira for refactoring HLog into an interface. I'll probably continue to work within trunk.

Separate JIRA should make things easier though.

Li Pi added a comment - 04/May/12 02:34 Replication is the failure point. I haven't really worked on that yet. Talked to Jon about the dev process. I'll create a seperate Jira for refactoring HLog into an interface. I'll probably continue to work within trunk. Separate JIRA should make things easier though.

Ted Yu added a comment - 04/May/12 02:49

Using trunk has the drawback that performance numbers (without this feature) gathered on day N may be obsolete by day N + 5, considering the amount of changes going into trunk.

I would suggest tackling replication as first priority. Dictionary WAL compression brought unexpected complexities w.r.t. replication. We shouldn't make replication any harder.

w.r.t. refactoring HLog into an interface, I tend to think that the interface should make different implementations possible.
If we only have one implementation, it is not easy to evaluate the effectiveness of the refactoring.

Ted Yu added a comment - 04/May/12 02:49 Using trunk has the drawback that performance numbers (without this feature) gathered on day N may be obsolete by day N + 5, considering the amount of changes going into trunk. I would suggest tackling replication as first priority. Dictionary WAL compression brought unexpected complexities w.r.t. replication. We shouldn't make replication any harder. w.r.t. refactoring HLog into an interface, I tend to think that the interface should make different implementations possible. If we only have one implementation, it is not easy to evaluate the effectiveness of the refactoring.

Ted Yu added a comment - 04/May/12 03:43

Here're the key unit tests that must pass:

TestDistributedLogSplitting, TestReplication, TestMasterReplication, TestMultiSlaveReplication, TestHLog, TestHLogSplit, TestLogRollAbort, TestLogRolling

Ted Yu added a comment - 04/May/12 03:43 Here're the key unit tests that must pass: TestDistributedLogSplitting, TestReplication, TestMasterReplication, TestMultiSlaveReplication, TestHLog, TestHLogSplit, TestLogRollAbort, TestLogRolling

Li Pi added a comment - 04/May/12 03:57

While performance numbers will change, you can simply test with MultipleHLogs on and MultipleHLogsoff. I don't think we're going to move everyone over to multiple Hlogs immediately.

Will take a look at those tests.

Li Pi added a comment - 04/May/12 03:57 While performance numbers will change, you can simply test with MultipleHLogs on and MultipleHLogsoff. I don't think we're going to move everyone over to multiple Hlogs immediately. Will take a look at those tests.

Ted Yu added a comment - 04/May/12 04:05

This feature requires validation in real cluster.

@Jonathan:
Are you able to help Li in this regard ?

From my experience in the past three weeks, development involves coding -> running test suite -> discovering defect through failed unit tests -> bug fixing -> validation through ycsb -> ...

Ted Yu added a comment - 04/May/12 04:05 This feature requires validation in real cluster. @Jonathan: Are you able to help Li in this regard ? From my experience in the past three weeks, development involves coding -> running test suite -> discovering defect through failed unit tests -> bug fixing -> validation through ycsb -> ...

Michael Stack added a comment - 04/May/12 04:18

@Li You should be able to work whereever you like if you make up an harness for running hlog implementations apart from hbase. This should be first order of business (unless you are a masochist). Should be easy enough, if its not possible already, especially after you make it pluggable.

Regards... "I'm assuming we don't need to guarantee HLog edit sequencing. If we do, this becomes a bit harder." – well thats the way it is currently so onus will be on you to come up a reason why it could be otherwise. In-order makes it easier to reason about whether or not all edits up to a particular sequence id have been sync'd or not.

And don't forget the other side of the moon, the (distributed) log splitting story. That needs to work too after you are done.

Michael Stack added a comment - 04/May/12 04:18 @Li You should be able to work whereever you like if you make up an harness for running hlog implementations apart from hbase. This should be first order of business (unless you are a masochist). Should be easy enough, if its not possible already, especially after you make it pluggable. Regards... "I'm assuming we don't need to guarantee HLog edit sequencing. If we do, this becomes a bit harder." – well thats the way it is currently so onus will be on you to come up a reason why it could be otherwise. In-order makes it easier to reason about whether or not all edits up to a particular sequence id have been sync'd or not. And don't forget the other side of the moon, the (distributed) log splitting story. That needs to work too after you are done.

Jonathan Hsieh added a comment - 04/May/12 08:04

Li, if you want to undertake this I'll help. Let's chat and then write a one-two page summary of what our goals are, what our assumptions are, and what our intended mechanisms are, how we are going to test this and then loopback here with a design/plan to get feedback..

Another "feature" that may come into play also is the HLog compression.

Jonathan Hsieh added a comment - 04/May/12 08:04 Li, if you want to undertake this I'll help. Let's chat and then write a one-two page summary of what our goals are, what our assumptions are, and what our intended mechanisms are, how we are going to test this and then loopback here with a design/plan to get feedback.. Another "feature" that may come into play also is the HLog compression.

Li Pi added a comment - 04/May/12 09:17

I thought about the compression bit already. I was going to compress each separate log individually.

Yeah, I should have probably wrote up what I was going to do before hacking stuff up. Will switch gears and work on that a bit instead.

Li Pi added a comment - 04/May/12 09:17 I thought about the compression bit already. I was going to compress each separate log individually. Yeah, I should have probably wrote up what I was going to do before hacking stuff up. Will switch gears and work on that a bit instead.

Ted Yu added a comment - 04/May/12 20:50

there is still a single "HLog" class, but underneath it would be multiple "SequenceFileLogWriters"

My approach is different from the above.
The new interface should be general enough that multi-HLog can be implemented without the requirement that HLog have multiple writers.

Ted Yu added a comment - 04/May/12 20:50 there is still a single "HLog" class, but underneath it would be multiple "SequenceFileLogWriters" My approach is different from the above. The new interface should be general enough that multi-HLog can be implemented without the requirement that HLog have multiple writers.

Jonathan Hsieh added a comment - 04/May/12 22:56

Zhihong, I'm curious to learn about the approach you have taken in the prototype that you have. Is it on github somewhere perhaps?

If you have multiple hlogs do you use a different hlog in different regions?
Do you have a shim that looks like an hlog but has two hlogs inside it (as opposed to hdfs file handles)?

Jonathan Hsieh added a comment - 04/May/12 22:56 Zhihong, I'm curious to learn about the approach you have taken in the prototype that you have. Is it on github somewhere perhaps? If you have multiple hlogs do you use a different hlog in different regions? Do you have a shim that looks like an hlog but has two hlogs inside it (as opposed to hdfs file handles)?

Ted Yu added a comment - 04/May/12 23:00

If you have multiple hlogs do you use a different hlog in different regions?

Correct.

I have to go through legal procedure at my employer before disclosing my patch.

Ted Yu added a comment - 04/May/12 23:00 If you have multiple hlogs do you use a different hlog in different regions? Correct. I have to go through legal procedure at my employer before disclosing my patch.

ramkrishna.s.vasudevan added a comment - 18/May/12 06:43

We are also interested in this.
Worked on a prototype with having one HLog instance but underlying there will be multiple writer instances. The regions will be allocated with any one of the writer instance and each region will be writing to hlog using the instance associated with it.

Even on logrolling the instances against each region will be updated and the region will continue to use its mapping.
Without patch
~53K puts/sec.

With patch
~78-80k puts/sec

It is a 3 node cluster and the size of each record was 1k. No of regions : 2800
By default used 3 writer instances. I was able to pass the testcases related to TestHlog and TestDistributedLogSplitting. But Testmasterreplication was not passing.
Replication needs some change based on this which i did not work on much.

The pendingWrites list that we use is now converted into a map having the writer with the list of pending writes.

Pls provide your suggestions on this.
BTW, Li Pi, any progress on this? I would love to help you in this.
May be i can prepare a more forma patch and upload over here.

ramkrishna.s.vasudevan added a comment - 18/May/12 06:43 We are also interested in this. Worked on a prototype with having one HLog instance but underlying there will be multiple writer instances. The regions will be allocated with any one of the writer instance and each region will be writing to hlog using the instance associated with it. Even on logrolling the instances against each region will be updated and the region will continue to use its mapping. Without patch ~53K puts/sec. With patch ~78-80k puts/sec It is a 3 node cluster and the size of each record was 1k. No of regions : 2800 By default used 3 writer instances. I was able to pass the testcases related to TestHlog and TestDistributedLogSplitting. But Testmasterreplication was not passing. Replication needs some change based on this which i did not work on much. The pendingWrites list that we use is now converted into a map having the writer with the list of pending writes. Pls provide your suggestions on this. BTW, Li Pi, any progress on this? I would love to help you in this. May be i can prepare a more forma patch and upload over here.

Ted Yu added a comment - 18/May/12 16:26

@Ramkrishna:
Your numbers look better than mine though the mix in my case was 50% updates and 50% puts.

Can you publish latency numbers as well ?

Ted Yu added a comment - 18/May/12 16:26 @Ramkrishna: Your numbers look better than mine though the mix in my case was 50% updates and 50% puts. Can you publish latency numbers as well ?

Lars Hofhansl added a comment - 18/May/12 18:55

Should we explore a WAL per Region? Would be a lot of open files, but if it'd work, we won't need log spitting anymore.

Lars Hofhansl added a comment - 18/May/12 18:55 Should we explore a WAL per Region? Would be a lot of open files, but if it'd work, we won't need log spitting anymore.

Ted Yu added a comment - 18/May/12 19:52

There would be many regions in a cluster. They may not receive even write load.

We should set configuration parameter which governs the maximum number of concurrent WALs on each region server.

Ted Yu added a comment - 18/May/12 19:52 There would be many regions in a cluster. They may not receive even write load. We should set configuration parameter which governs the maximum number of concurrent WALs on each region server.

Li Pi added a comment - 18/May/12 23:01

My design is a bit different. Ill upload a patch soon. I'm doing any region
to any blog. Currently distributed log splitting and replication do not
work yet.

Li Pi added a comment - 18/May/12 23:01 My design is a bit different. Ill upload a patch soon. I'm doing any region to any blog. Currently distributed log splitting and replication do not work yet.

Li Pi added a comment - 18/May/12 23:19

Btw. I have finals and other stuff coming up. So it might be a while before
I finish my implementation. If anyone else wants to take a go at it. This
is cool.

Li Pi added a comment - 18/May/12 23:19 Btw. I have finals and other stuff coming up. So it might be a while before I finish my implementation. If anyone else wants to take a go at it. This is cool.

Lars Hofhansl added a comment - 20/May/12 04:47 - edited

I suspect this will become more important when people eventually turn on ~~HBASE-5954~~ (durable sync, if they don't run in data centers with backup power supplies).

There would be many regions in a cluster. They may not receive even write load.

Is that necessarily a problem? Just saying that while we are exploring this, might as well explore this option as well. I for one be happy if a region's edits are tied to that region and log splitting could just go away (well almost, would still need to split if the region is split).

Lars Hofhansl added a comment - 20/May/12 04:47 - edited I suspect this will become more important when people eventually turn on HBASE-5954 (durable sync, if they don't run in data centers with backup power supplies). There would be many regions in a cluster. They may not receive even write load. Is that necessarily a problem? Just saying that while we are exploring this, might as well explore this option as well. I for one be happy if a region's edits are tied to that region and log splitting could just go away (well almost, would still need to split if the region is split).

Todd Lipcon added a comment - 20/May/12 05:24

I think with durable sync, having a WAL-per-region would be even less feasible than it is today – we currently depend on batching in order to get good throughput. If a server has 50 regions, then you'd get 50x less batching opportunity and write throughput would grind to a halt. Imagine a fan-out write to all of the regions – it would generate 50 disk seeks instead of just 1.

Todd Lipcon added a comment - 20/May/12 05:24 I think with durable sync, having a WAL-per-region would be even less feasible than it is today – we currently depend on batching in order to get good throughput. If a server has 50 regions, then you'd get 50x less batching opportunity and write throughput would grind to a halt. Imagine a fan-out write to all of the regions – it would generate 50 disk seeks instead of just 1.

Lars Hofhansl added a comment - 21/May/12 01:12

Good point.

Was referring to the general feature, not necessarily WAL/Region.
It's a trade off: Batching vs. parallel writes (just to state the obvious)

Do we batch beyond a region normally, though? Maybe during cache flush.

Yeah, WAL/Region with sync is probably not a good idea, there just won't be enough spindles in the HDFS cluster to absorb that.

So what's a good heuristic for the number of WALs? Maybe (assuming good block distribution and that HBase is the only user of the cluster) it should be around #spindles/#replicas...?

Lars Hofhansl added a comment - 21/May/12 01:12 Good point. Was referring to the general feature, not necessarily WAL/Region. It's a trade off: Batching vs. parallel writes (just to state the obvious) Do we batch beyond a region normally, though? Maybe during cache flush. Yeah, WAL/Region with sync is probably not a good idea, there just won't be enough spindles in the HDFS cluster to absorb that. So what's a good heuristic for the number of WALs? Maybe (assuming good block distribution and that HBase is the only user of the cluster) it should be around #spindles/#replicas...?

ramkrishna.s.vasudevan added a comment - 04/Jun/12 13:57

Perf results.
@Ted
The file attached also has the latency results. Run using LoadTestTool. Sorry for being little late. Patch will upload later.

ramkrishna.s.vasudevan added a comment - 04/Jun/12 13:57 Perf results. @Ted The file attached also has the latency results. Run using LoadTestTool. Sorry for being little late. Patch will upload later.

Ted Yu added a comment - 04/Jun/12 14:39

Can you run ycsb with 50% insert and 50% update load ?
Performance numbers in attachment match what I got based on my implementation.

Thanks

Ted Yu added a comment - 04/Jun/12 14:39 Can you run ycsb with 50% insert and 50% update load ? Performance numbers in attachment match what I got based on my implementation. Thanks

Lars Hofhansl added a comment - 04/Jun/12 17:08

This is related ~~HBASE-6116~~.
~~HBASE-6116~~ would improve latency, whereas this issues would mostly improve throughput.

Lars Hofhansl added a comment - 04/Jun/12 17:08 This is related HBASE-6116 . HBASE-6116 would improve latency, whereas this issues would mostly improve throughput.

Lars Hofhansl added a comment - 04/Jun/12 17:15

@Ted or @Ram: If you have any chance to test ~~HBASE-6116~~ as well, that'd be really cool (although it would be more effort, as it only works against Hadoop trunk - and soon Hadoop 2.0-alpha).
Andy said he might test against EC2.

Lars Hofhansl added a comment - 04/Jun/12 17:15 @Ted or @Ram: If you have any chance to test HBASE-6116 as well, that'd be really cool (although it would be more effort, as it only works against Hadoop trunk - and soon Hadoop 2.0-alpha). Andy said he might test against EC2.

Michael Stack added a comment - 05/Jun/12 05:27

Whats the high level on the perf numbers? Does more WALs help? How much? Thanks.

Michael Stack added a comment - 05/Jun/12 05:27 Whats the high level on the perf numbers? Does more WALs help? How much? Thanks.

ramkrishna.s.vasudevan added a comment - 05/Jun/12 12:38

@Ted
The ycsb report we will get it tomorrow. Today environemnt is busy.
@Lars
We will try to check ~~HBASE-6116~~ also but not very sure if in the next couple of days. Anyway will try.

ramkrishna.s.vasudevan added a comment - 05/Jun/12 12:38 @Ted The ycsb report we will get it tomorrow. Today environemnt is busy. @Lars We will try to check HBASE-6116 also but not very sure if in the next couple of days. Anyway will try.

Lars Hofhansl added a comment - 12/Jun/12 09:27

I think we should wait for test result with ~~HBASE-6116~~ before we invest more time in this.
My gut feeling tells me, that is something that is better handled at the HDFS level.

Lars Hofhansl added a comment - 12/Jun/12 09:27 I think we should wait for test result with HBASE-6116 before we invest more time in this. My gut feeling tells me, that is something that is better handled at the HDFS level.

Ted Yu added a comment - 12/Jun/12 11:23

As I mentioned in ~~HBASE-6055~~ @ 04/Jun/12 17:47, one of the benefits of this feature is for each HLog file to receive edits for one single table.

Ted Yu added a comment - 12/Jun/12 11:23 As I mentioned in HBASE-6055 @ 04/Jun/12 17:47, one of the benefits of this feature is for each HLog file to receive edits for one single table.

Todd Lipcon added a comment - 12/Jun/12 19:08

I think we should wait for test result with ~~HBASE-6116~~ before we invest more time in this.

~~HBASE-6116~~ seems like it would improve latency but hurt throughput – on a typical gbit link, the parallel writes would limit us to 50M/sec for 3 replicas, whereas pipelined writes could give us 100M+.

The other main advantage of this JIRA is that the speed of the WAL is currently limited to the minimum speed of the 3 disks chosen in the pipeline. Given that disks can be heavily loaded, the probability of getting even a full disk's worth of throughput is low – the likelihood is that at least one of those disks is also being written to or read from at least another client. So typically any single HDFS stream is limited to 35-40MB/sec in my experience.

Given that gbit is much faster than this, we can get better throughput by adding parallel WALs, so as to stripe across disks and dynamically push writes to less-loaded disks.

Todd Lipcon added a comment - 12/Jun/12 19:08 I think we should wait for test result with HBASE-6116 before we invest more time in this. HBASE-6116 seems like it would improve latency but hurt throughput – on a typical gbit link, the parallel writes would limit us to 50M/sec for 3 replicas, whereas pipelined writes could give us 100M+. The other main advantage of this JIRA is that the speed of the WAL is currently limited to the minimum speed of the 3 disks chosen in the pipeline. Given that disks can be heavily loaded, the probability of getting even a full disk's worth of throughput is low – the likelihood is that at least one of those disks is also being written to or read from at least another client. So typically any single HDFS stream is limited to 35-40MB/sec in my experience. Given that gbit is much faster than this, we can get better throughput by adding parallel WALs, so as to stripe across disks and dynamically push writes to less-loaded disks.

Lars Hofhansl added a comment - 25/Jun/12 16:00

Assuming Datanodes and RegionServers are colocated no more bits will have to cross the (aggregate) "wires". Further assuming good load balancing within HBase the net bandwidth is still spread over the cluster (but with lower latency at each RegionServer).
So I do not believe that ~~HBASE-6116~~ will actually hurt performance.

The key question is whether WAL writing is mostly bound by latency or bandwidth (And I do not know.)
Do we get 35-40mb throughput from writing the WAL? If not, it is likely bound by latency.

Lars Hofhansl added a comment - 25/Jun/12 16:00 Assuming Datanodes and RegionServers are colocated no more bits will have to cross the (aggregate) "wires". Further assuming good load balancing within HBase the net bandwidth is still spread over the cluster (but with lower latency at each RegionServer). So I do not believe that HBASE-6116 will actually hurt performance. The key question is whether WAL writing is mostly bound by latency or bandwidth (And I do not know.) Do we get 35-40mb throughput from writing the WAL? If not, it is likely bound by latency.

Hudson added a comment - 02/Oct/12 21:25

Integrated in HBase-TRUNK #3408 (See https://builds.apache.org/job/HBase-TRUNK/3408/)
~~HBASE-5699~~ Refactor HLog into an interface (Revision 1393126)

Result = FAILURE
stack :
Files :

/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/example/LongTermArchivingHFileCleaner.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HLogInputFormat.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/CleanerChore.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/LogCleaner.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/RegionServerMetrics.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/Compressor.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogFactory.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogMetrics.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogPrettyPrinter.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogUtil.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogWriter.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALActionsListener.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCoprocessorHost.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HFileArchiveUtil.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HMerge.java
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/MetaUtils.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/example/TestZooKeeperTableArchiveClient.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRowProcessorEndpoint.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestWALObserver.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/fs/TestBlockReorder.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHLogRecordReader.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactSelection.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/FaultySequenceFileLogReader.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/HLogPerformanceEvaluation.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/HLogUtilsForTests.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLogMethods.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLogSplit.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRollAbort.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRollingNoCluster.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALActionsListener.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestMergeTool.java

Hudson added a comment - 02/Oct/12 21:25 Integrated in HBase-TRUNK #3408 (See https://builds.apache.org/job/HBase-TRUNK/3408/ ) HBASE-5699 Refactor HLog into an interface (Revision 1393126) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/example/LongTermArchivingHFileCleaner.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HLogInputFormat.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/CleanerChore.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/LogCleaner.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/RegionServerMetrics.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/Compressor.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogFactory.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogMetrics.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogPrettyPrinter.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogUtil.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogWriter.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALActionsListener.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCoprocessorHost.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HFileArchiveUtil.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HMerge.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/MetaUtils.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/example/TestZooKeeperTableArchiveClient.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRowProcessorEndpoint.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestWALObserver.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/fs/TestBlockReorder.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHLogRecordReader.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactSelection.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/FaultySequenceFileLogReader.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/HLogPerformanceEvaluation.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/HLogUtilsForTests.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLogMethods.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLogSplit.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRollAbort.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRollingNoCluster.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALActionsListener.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestMergeTool.java

Nicolas Liochon added a comment - 15/Apr/13 08:53

If we implement this, we should test the impact on MTTR as well. My fear is that we will have much more lease to recover, and the way it's written today (one after this other), it could make failure recovery much slower on a small cluster.

Nicolas Liochon added a comment - 15/Apr/13 08:53 If we implement this, we should test the impact on MTTR as well. My fear is that we will have much more lease to recover, and the way it's written today (one after this other), it could make failure recovery much slower on a small cluster.

Anoop Sam John added a comment - 15/Apr/13 10:11

May be we need to combine efforts here with ~~HBASE-7835~~
jeffreyz Working with ~~HBASE-7835~~ where he try to do WAL replay with HTable#put()

I had added below comment to ~~HBASE-7835~~

I was thinking on this area. We have different JIRAs now related to HLOG and its split and replay.
This one + ~~HBASE-6772~~ + multi WAL...
Can we think on all together.
For multi WAL if we have a fixed set of regions for one WAL approach, during one RS down Master can assign those regions(Try max) to another one RS [Region groups in RS]. If the corresponding HLog file also assigned to that RS, then for the replay it can directly do puts on the region rather than IPC.

If we can do all these I think MTTR also can be improved.
I will start working with JIRA (along with Ram) from next week.

Anoop Sam John added a comment - 15/Apr/13 10:11 May be we need to combine efforts here with HBASE-7835 jeffreyz Working with HBASE-7835 where he try to do WAL replay with HTable#put() I had added below comment to HBASE-7835 I was thinking on this area. We have different JIRAs now related to HLOG and its split and replay. This one + HBASE-6772 + multi WAL... Can we think on all together. For multi WAL if we have a fixed set of regions for one WAL approach, during one RS down Master can assign those regions(Try max) to another one RS [Region groups in RS] . If the corresponding HLog file also assigned to that RS, then for the replay it can directly do puts on the region rather than IPC. If we can do all these I think MTTR also can be improved. I will start working with JIRA (along with Ram) from next week.

Ted Yu added a comment - 13/Nov/13 18:33

we will have much more lease to recover

At the beginning of recovery, master can send lease recovery requests for outstanding WAL files using thread pool.
Each split worker would first check whether the WAL file it processes is closed.

Ted Yu added a comment - 13/Nov/13 18:33 we will have much more lease to recover At the beginning of recovery, master can send lease recovery requests for outstanding WAL files using thread pool. Each split worker would first check whether the WAL file it processes is closed.

Michael Stack added a comment - 26/Feb/14 01:27

This issue adds switching between WALs

Michael Stack added a comment - 26/Feb/14 01:27 This issue adds switching between WALs

Anoop Sam John added a comment - 07/Aug/14 16:35

My idea is like to make a multi WAL impl which helps write throughput as well as MTTR. The MTTR when we have the distributed log replay mode. If we can make sure to have region grouping policy in selecting the regions for a WAL in multi WAL area, we can try max to allocate all those regions to same RS on crash. So this RS can read this WAL and replay locally. The distributed log replay batch calls has not to go over RPC .. Lots of Qs and corner cases there. But we can discuss more on and try to make it better.

Anoop Sam John added a comment - 07/Aug/14 16:35 My idea is like to make a multi WAL impl which helps write throughput as well as MTTR. The MTTR when we have the distributed log replay mode. If we can make sure to have region grouping policy in selecting the regions for a WAL in multi WAL area, we can try max to allocate all those regions to same RS on crash. So this RS can read this WAL and replay locally. The distributed log replay batch calls has not to go over RPC .. Lots of Qs and corner cases there. But we can discuss more on and try to make it better.

Sean Busbey added a comment - 07/Aug/14 16:57

anoop.hbase, that sounds like a combination of this and the ideas in ~~HBASE-8610~~?

Sean Busbey added a comment - 07/Aug/14 16:57 anoop.hbase , that sounds like a combination of this and the ideas in HBASE-8610 ?

Anoop Sam John added a comment - 07/Aug/14 17:08

Ya collective ideas around MultiWAL

Anoop Sam John added a comment - 07/Aug/14 17:08 Ya collective ideas around MultiWAL

ramkrishna.s.vasudevan added a comment - 07/Aug/14 17:23

Grouping based on region should in itself be a pluggable module because a simple thing could be just based on a specific factor (like group every 5 regions) or could be based on names. To start with we could do simple grouping.

we can try max to allocate all those regions to same RS on crash. So this RS can read this WAL and replay locally.

To replay locally we should avoid the RPC itself totally? Is it possible in the new distributed log replay? It tries to create do table.batchmutate() right. Need to see the code to confirm this once.

ramkrishna.s.vasudevan added a comment - 07/Aug/14 17:23 Grouping based on region should in itself be a pluggable module because a simple thing could be just based on a specific factor (like group every 5 regions) or could be based on names. To start with we could do simple grouping. we can try max to allocate all those regions to same RS on crash. So this RS can read this WAL and replay locally. To replay locally we should avoid the RPC itself totally? Is it possible in the new distributed log replay? It tries to create do table.batchmutate() right. Need to see the code to confirm this once.

Anoop Sam John added a comment - 07/Aug/14 17:32

Yes, the grouping logic should be a pluggable module. The grouping can be per table regions wise or on all regions. It should be inline with balancing strategy. (per table or not)

To replay locally we should avoid the RPC itself totally? Is it possible in the new distributed log replay?

Have not checked deeply with code. These are like high level thoughts only. We can check more. If we can avoid RPCs in the replay that would be great IMO.

Anoop Sam John added a comment - 07/Aug/14 17:32 Yes, the grouping logic should be a pluggable module. The grouping can be per table regions wise or on all regions. It should be inline with balancing strategy. (per table or not) To replay locally we should avoid the RPC itself totally? Is it possible in the new distributed log replay? Have not checked deeply with code. These are like high level thoughts only. We can check more. If we can avoid RPCs in the replay that would be great IMO.

Sean Busbey added a comment - 23/Oct/14 20:24

I have a patch implementing this on top of the refactoring in ~~HBASE-10378~~. Any objections to me taking over the issue and posting it?

Sean Busbey added a comment - 23/Oct/14 20:24 I have a patch implementing this on top of the refactoring in HBASE-10378 . Any objections to me taking over the issue and posting it?

Sean Busbey added a comment - 16/Nov/14 06:24

since the set of links is getting shortened here's an extra link to the reviewboard (depends on the changes from HBASE-10378)

Sean Busbey added a comment - 16/Nov/14 06:24 since the set of links is getting shortened here's an extra link to the reviewboard (depends on the changes from HBASE-10378)

Hadoop QA added a comment - 16/Nov/14 06:32

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12530779/PerfHbase.txt
against trunk revision .
ATTACHMENT ID: 12530779

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11693//console

This message is automatically generated.

Hadoop QA added a comment - 16/Nov/14 06:32 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530779/PerfHbase.txt against trunk revision . ATTACHMENT ID: 12530779 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11693//console This message is automatically generated.

Sean Busbey added a comment - 10/Dec/14 18:33

adding current RB version for QA run.

Sean Busbey added a comment - 10/Dec/14 18:33 adding current RB version for QA run.

Sean Busbey added a comment - 10/Dec/14 19:18

Attaching results from some initial testing using WALPerformanceEvaluation with a sync-heavy workload.

If anyone has other measurements they'd like to see, or a different workload expressed (perhaps to look at limits for pushing bytes given larger edits with fewer syncs), please let me know.

Overview

command used was

bin/hbase org.apache.hadoop.hbase.wal.WALPerformanceEvaluation -threads ${threads} -regions $(((threads+1)/2)) -roll 100000 -iterations 1000000 -verify

with # threads varied and # regions at ceil(threads/2). default is sync-per-write.

Test rig is a physical cluster with 4 data nodes and a total of 20 data disks (5 per node). Test client was run on a separate non-loaded host. HDFS 2.5.0-cdh5.2.0

.bz2 files are the complete logs from the described group of runs.
"upstream" and "wals-1" data is from prior to ~~HBASE-12655~~, so run metrics other than the final benchmark results aren't comparable to later.
There's an image showing total write iops across the cluster for each of the sets of test runs.
hbase-5699_total_throughput_sync_heavy.txt has the final benchmark log from each of the runs, so you can quickly look across successive runs.
hbase-5699_multiwal_400-threads_stats_sync_heavy.txt has the run metrics from just the final test of the multiwal options

upstream vs multiwal-1

If you look at the two charts "~~HBASE-5699~~_write_iops_upstream_1_to_200_threads.tiff" and "~~HBASE-5699~~_write_iops_multiwal-1_1_to_200_threads.tiff", they behave roughly the same modulo noise. (the only difference in the two happens during region set up, which shouldn't be reflected here.)

They also level off on ability to push more through the pipeline at near the limit for iops given the 3 disks in the single pipeline.

increasing number of pipelines

If you look at each of the ~~HBASE-5699~~_write_iops_multiwal-X_10,50,120,190,260,330,400_threads.tiff charts, as we ramp up the number of writers we manage to push more overall activity through the cluster.

It's not a linear gain because splitting out the pipelines means that we do more overall syncs since fewer of them get obviated by our sync grouping.

In this test, expanding from 2 to 4 or 6 pipelines didn't provide much benefit because at up to 400 concurrent sync-heavy writers we just get to maxing out the number of iops that can be done with 2 pipelines.

Sean Busbey added a comment - 10/Dec/14 19:18 Attaching results from some initial testing using WALPerformanceEvaluation with a sync-heavy workload. If anyone has other measurements they'd like to see, or a different workload expressed (perhaps to look at limits for pushing bytes given larger edits with fewer syncs), please let me know. Overview command used was bin/hbase org.apache.hadoop.hbase.wal.WALPerformanceEvaluation -threads ${threads} -regions $(((threads+1)/2)) -roll 100000 -iterations 1000000 -verify with # threads varied and # regions at ceil(threads/2). default is sync-per-write. Test rig is a physical cluster with 4 data nodes and a total of 20 data disks (5 per node). Test client was run on a separate non-loaded host. HDFS 2.5.0-cdh5.2.0 .bz2 files are the complete logs from the described group of runs. "upstream" and "wals-1" data is from prior to HBASE-12655 , so run metrics other than the final benchmark results aren't comparable to later. There's an image showing total write iops across the cluster for each of the sets of test runs. hbase-5699_total_throughput_sync_heavy.txt has the final benchmark log from each of the runs, so you can quickly look across successive runs. hbase-5699_multiwal_400-threads_stats_sync_heavy.txt has the run metrics from just the final test of the multiwal options upstream vs multiwal-1 If you look at the two charts " HBASE-5699 _write_iops_upstream_1_to_200_threads.tiff" and " HBASE-5699 _write_iops_multiwal-1_1_to_200_threads.tiff", they behave roughly the same modulo noise. (the only difference in the two happens during region set up, which shouldn't be reflected here.) They also level off on ability to push more through the pipeline at near the limit for iops given the 3 disks in the single pipeline. increasing number of pipelines If you look at each of the HBASE-5699 _write_iops_multiwal-X_10,50,120,190,260,330,400_threads.tiff charts, as we ramp up the number of writers we manage to push more overall activity through the cluster. It's not a linear gain because splitting out the pipelines means that we do more overall syncs since fewer of them get obviated by our sync grouping. In this test, expanding from 2 to 4 or 6 pipelines didn't provide much benefit because at up to 400 concurrent sync-heavy writers we just get to maxing out the number of iops that can be done with 2 pipelines.

Jonathan Hsieh added a comment - 11/Dec/14 02:05

Nice results!

A few questions thoughts.

How do iops/s translate to mb/s? Are the reasonably close to max disk speed/3?

Can you combine a few graphs so that you can see the main jump from 1 pipeline to 2 pipelines? the graphs currently don't quite line up. Instead of having time in x axis, use the # threads in x and show avg/std iop/s of each of the thread settings and # of pipeline settings?

Jonathan Hsieh added a comment - 11/Dec/14 02:05 Nice results! A few questions thoughts. How do iops/s translate to mb/s? Are the reasonably close to max disk speed/3? Can you combine a few graphs so that you can see the main jump from 1 pipeline to 2 pipelines? the graphs currently don't quite line up. Instead of having time in x axis, use the # threads in x and show avg/std iop/s of each of the thread settings and # of pipeline settings?

Sean Busbey added a comment - 11/Dec/14 23:56

I'm running some more tests. I redid the single wal test up to 400 concurrent sync heavy writers so I could make the comparison charts you asked for. In doing so I got more though-put in that case; it looks like we're pushing enough data then to require 1 - 1.5 new blocks each second, so we're effectively spreading across more disks.

Sean Busbey added a comment - 11/Dec/14 23:56 I'm running some more tests. I redid the single wal test up to 400 concurrent sync heavy writers so I could make the comparison charts you asked for. In doing so I got more though-put in that case; it looks like we're pushing enough data then to require 1 - 1.5 new blocks each second, so we're effectively spreading across more disks.

Hadoop QA added a comment - 15/Dec/14 16:02

+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12687232/HBASE-5699.4.patch.txt
against master branch at commit db873f0886ec43e2e5b3bdcb56399b3bceb4dcaa.
ATTACHMENT ID: 12687232

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 checkstyle. The applied patch does not increase the total number of checkstyle errors

+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 lineLengths. The patch does not introduce lines longer than 100

+1 site. The mvn site goal succeeds with this patch.

+1 core tests. The patch passed unit tests in .

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//console

This message is automatically generated.

Hadoop QA added a comment - 15/Dec/14 16:02 +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687232/HBASE-5699.4.patch.txt against master branch at commit db873f0886ec43e2e5b3bdcb56399b3bceb4dcaa. ATTACHMENT ID: 12687232 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified tests. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 checkstyle . The applied patch does not increase the total number of checkstyle errors +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12084//console This message is automatically generated.

Sean Busbey added a comment - 15/Dec/14 17:50

Here's a chart with # workers on the x-axis and MiB/s on the y, named "~~HBASE-5699~~_#workers_vs_MiB_per_s_1x1col_512Bval_wal_count_1,2,4.tiff"

The MiB/s is the rate for a 30 second sample. The dark blue, green, and red lines are for single wal, 2 wals, and 4 wals respectively. Each one has a lighter shaded line above and below marking 1 std dev in each direction.

This is with the default block size of 128MiB for wals and a single cf:cq combo with 512B value in each edit.

I've got numbers for some other block sizes finishing up this morning; should have more charts tomorrow.

Sean Busbey added a comment - 15/Dec/14 17:50 Here's a chart with # workers on the x-axis and MiB/s on the y, named " HBASE-5699 _#workers_vs_MiB_per_s_1x1col_512Bval_wal_count_1,2,4.tiff" The MiB/s is the rate for a 30 second sample. The dark blue, green, and red lines are for single wal, 2 wals, and 4 wals respectively. Each one has a lighter shaded line above and below marking 1 std dev in each direction. This is with the default block size of 128MiB for wals and a single cf:cq combo with 512B value in each edit. I've got numbers for some other block sizes finishing up this morning; should have more charts tomorrow.

Jonathan Hsieh added a comment - 15/Dec/14 19:09

I like the new graph – summarizes a lot and is still simple to follow.

I'm trying to make sense of the 20-25% boost (hoped for more!). These are 5 disk machines – are these boxes configured so that one disk per machine set aside for the os and 4 disks are "data" drives?

Jonathan Hsieh added a comment - 15/Dec/14 19:09 I like the new graph – summarizes a lot and is still simple to follow. I'm trying to make sense of the 20-25% boost (hoped for more!). These are 5 disk machines – are these boxes configured so that one disk per machine set aside for the os and 4 disks are "data" drives?

Michael Stack added a comment - 15/Dec/14 19:40

What would the charts look like if no disk friction at all, i.e. no a mocked WAL? Are we using all available i/o or are we blocked internally – cpu/locks/context switching? Nice graphs Sean. How'd you make them/run the tests?

Michael Stack added a comment - 15/Dec/14 19:40 What would the charts look like if no disk friction at all, i.e. no a mocked WAL? Are we using all available i/o or are we blocked internally – cpu/locks/context switching? Nice graphs Sean. How'd you make them/run the tests?

Sean Busbey added a comment - 15/Dec/14 19:49

These are nodes with 5 data drives. There are 6 physical disks in the machines and one is set aside for OS.

I was also surprised about the boost, since my initial thinking was that we'd be constrained in the single wal case by writing to one pipeline. However, once I think through things more it makes a little more sense. For one thing we only use hflush and not hsync, so even for network flushes we're still largely in memory. That also means those datanodes can keep handling the write to disk as we get the next pipeline. On the far end of this chart for 128MB blocks, that should be happening every ~1.5 seconds for the single wal. That allows us to keep more than just 3 disks busy even in the single wal case.

It's possible the gain shown by the perf eval will be bigger once I get ~~HBASE-12339~~ in place. As is, we're paying the overhead of new block allocation instead of the overhead of new file allocation. I don't have enough info to know if that delta matters though.

Sean Busbey added a comment - 15/Dec/14 19:49 These are nodes with 5 data drives. There are 6 physical disks in the machines and one is set aside for OS. I was also surprised about the boost, since my initial thinking was that we'd be constrained in the single wal case by writing to one pipeline. However, once I think through things more it makes a little more sense. For one thing we only use hflush and not hsync, so even for network flushes we're still largely in memory. That also means those datanodes can keep handling the write to disk as we get the next pipeline. On the far end of this chart for 128MB blocks, that should be happening every ~1.5 seconds for the single wal. That allows us to keep more than just 3 disks busy even in the single wal case. It's possible the gain shown by the perf eval will be bigger once I get HBASE-12339 in place. As is, we're paying the overhead of new block allocation instead of the overhead of new file allocation. I don't have enough info to know if that delta matters though.

Sean Busbey added a comment - 15/Dec/14 19:56

What would the charts look like if no disk friction at all, i.e. no a mocked WAL? Are we using all available i/o or are we blocked internally – cpu/locks/context switching?

I could run this as a baseline with the DisabledWALProvider. IIRC all it does is increment metric counts.

Nice graphs Sean. How'd you make them/run the tests?

The tests are runs of the WALPerformanceEval tool using the command up above under "Overview". To alter the WAL count I made multiple conf dirs with that setting switched in hbase-site.xml and then exported an appropriate HBASE_CONF_DIR before each run. The tests I ran over the weekend (but haven't gotten to chart yet) are the same but with more options configured.

The chart itself is just the output from the log of the test filtered for the append byte counts that happen every 30 seconds, then put into a google doc to do deltas and avg / stddev. I was considering using the same data to plot the IQR instead of average +- stddev, but I wasn't sure that would be as consumable.

Sean Busbey added a comment - 15/Dec/14 19:56 What would the charts look like if no disk friction at all, i.e. no a mocked WAL? Are we using all available i/o or are we blocked internally – cpu/locks/context switching? I could run this as a baseline with the DisabledWALProvider. IIRC all it does is increment metric counts. Nice graphs Sean. How'd you make them/run the tests? The tests are runs of the WALPerformanceEval tool using the command up above under "Overview". To alter the WAL count I made multiple conf dirs with that setting switched in hbase-site.xml and then exported an appropriate HBASE_CONF_DIR before each run. The tests I ran over the weekend (but haven't gotten to chart yet) are the same but with more options configured. The chart itself is just the output from the log of the test filtered for the append byte counts that happen every 30 seconds, then put into a google doc to do deltas and avg / stddev. I was considering using the same data to plot the IQR instead of average +- stddev, but I wasn't sure that would be as consumable.

Sean Busbey added a comment - 15/Dec/14 20:25

I was also surprised about the boost, since my initial thinking was that we'd be constrained in the single wal case by writing to one pipeline. However, once I think through things more it makes a little more sense. For one thing we only use hflush and not hsync, so even for network flushes we're still largely in memory. That also means those datanodes can keep handling the write to disk as we get the next pipeline. On the far end of this chart for 128MB blocks, that should be happening every ~1.5 seconds for the single wal. That allows us to keep more than just 3 disks busy even in the single wal case.

One other note about the rate at which we are rolling pipelines already. This had me thinking about the improvements in ~~HBASE-10278~~. If we're rolling that often under load, I wonder if we'd be better off just forcing a roll at whatever the "pipeline sync is slow" threshold is rather than maintain the state to do a switch. A question better exploded on that ticket, I suppose.

Sean Busbey added a comment - 15/Dec/14 20:25 I was also surprised about the boost, since my initial thinking was that we'd be constrained in the single wal case by writing to one pipeline. However, once I think through things more it makes a little more sense. For one thing we only use hflush and not hsync, so even for network flushes we're still largely in memory. That also means those datanodes can keep handling the write to disk as we get the next pipeline. On the far end of this chart for 128MB blocks, that should be happening every ~1.5 seconds for the single wal. That allows us to keep more than just 3 disks busy even in the single wal case. One other note about the rate at which we are rolling pipelines already. This had me thinking about the improvements in HBASE-10278 . If we're rolling that often under load, I wonder if we'd be better off just forcing a roll at whatever the "pipeline sync is slow" threshold is rather than maintain the state to do a switch. A question better exploded on that ticket, I suppose.

Michael Stack added a comment - 15/Dec/14 22:09

I could run this as a baseline with the DisabledWALProvider.

Could be informative learning upper bound on how many writes/sec we can drive.

Michael Stack added a comment - 15/Dec/14 22:09 I could run this as a baseline with the DisabledWALProvider. Could be informative learning upper bound on how many writes/sec we can drive.

Elliott Neil Clark added a comment - 15/Dec/14 22:13

busbey ~~HBASE-11283~~ did something like that for 0.89-fb. Basically any sync that took longer than 1 second caused a roll of the HLog. The thought being that we could get a new write pipeline that might avoid a dead/dying disk or datanode.

Elliott Neil Clark added a comment - 15/Dec/14 22:13 busbey HBASE-11283 did something like that for 0.89-fb. Basically any sync that took longer than 1 second caused a roll of the HLog. The thought being that we could get a new write pipeline that might avoid a dead/dying disk or datanode.

Sean Busbey added a comment - 16/Dec/14 17:54

Attaching a plot that includes running the same tests with the delegate wals as DisabledWALProvider, named "~~HBASE-5699~~disabled_and_regular#workers_vs_MiB_per_s_1x1col_512Bval_wal_count_1,2,4".

This should show the limit from context switching and such in the test itself. The DisabledWALProvider doesn't include any of the overhead from the ringbuffer or sync grouping.

There are very few data points for the new test cases, so I didn't include any stddev bars. I just used the average for hte whole run. All of them were so short that they probably didn't have time to get into steady state.

(The disabled wals are at the top, the previous runs with datanode writes are at the bottom)

Sean Busbey added a comment - 16/Dec/14 17:54 Attaching a plot that includes running the same tests with the delegate wals as DisabledWALProvider, named " HBASE-5699 disabled_and_regular #workers_vs_MiB_per_s_1x1col_512Bval_wal_count_1,2,4". This should show the limit from context switching and such in the test itself. The DisabledWALProvider doesn't include any of the overhead from the ringbuffer or sync grouping. There are very few data points for the new test cases, so I didn't include any stddev bars. I just used the average for hte whole run. All of them were so short that they probably didn't have time to get into steady state. (The disabled wals are at the top, the previous runs with datanode writes are at the bottom)

Michael Stack added a comment - 16/Dec/14 18:46

Thanks busbey Looks like writing datanodes puts a bit of friction on our write path. I wonder how much the ringbuffer+grouping is costing us? Looking at the graph, you'd think adding extra WALs would make a bigger difference given the large gap between no-friction and one WAL. Good stuff.

Michael Stack added a comment - 16/Dec/14 18:46 Thanks busbey Looks like writing datanodes puts a bit of friction on our write path. I wonder how much the ringbuffer+grouping is costing us? Looking at the graph, you'd think adding extra WALs would make a bigger difference given the large gap between no-friction and one WAL. Good stuff.

Michael Stack added a comment - 18/Dec/14 23:39

What you thinking here busbey ? So, we'd commit this but default is 'identity' as I read it? i.e. a log per region. Is that so? Would be cool if multiwal were on by default or does it need to show better numbers to be on by default?

Michael Stack added a comment - 18/Dec/14 23:39 What you thinking here busbey ? So, we'd commit this but default is 'identity' as I read it? i.e. a log per region. Is that so? Would be cool if multiwal were on by default or does it need to show better numbers to be on by default?

Sean Busbey added a comment - 19/Dec/14 00:22

I don't have any ideas for region grouping besides identity; I think it's a fine starting point. The current configuration makes "multiwal" the bounded version, so there's only N configurable wals (the current patch says 1, but 2 might make more sense). It doesn't given an option yet for "wal per region", so anyone who wants to use that would have to manually specify the fully qualified class name. If we wanted to give them assurances about compatibility we'd need to give them a config name.

I'm split on wether it makes sense to make it the default given the modest improvement. I'm inclined to leave the default alone for 1.0 and change it for later once we have some more stats on e.g. impact on recovery. My intuition says wal recovery should be roughly the same.

Sean Busbey added a comment - 19/Dec/14 00:22 I don't have any ideas for region grouping besides identity; I think it's a fine starting point. The current configuration makes "multiwal" the bounded version, so there's only N configurable wals (the current patch says 1, but 2 might make more sense). It doesn't given an option yet for "wal per region", so anyone who wants to use that would have to manually specify the fully qualified class name. If we wanted to give them assurances about compatibility we'd need to give them a config name. I'm split on wether it makes sense to make it the default given the modest improvement. I'm inclined to leave the default alone for 1.0 and change it for later once we have some more stats on e.g. impact on recovery. My intuition says wal recovery should be roughly the same.

Michael Stack added a comment - 19/Dec/14 00:27

I'm inclined to leave the default alone for 1.0 and change it for later once we have some more stats on e.g. impact on recovery.

Grand.

I gave this +1 over on rb if you want to commit. Fat release note I'd say (but you were probably going to do that anyways)

Michael Stack added a comment - 19/Dec/14 00:27 I'm inclined to leave the default alone for 1.0 and change it for later once we have some more stats on e.g. impact on recovery. Grand. I gave this +1 over on rb if you want to commit. Fat release note I'd say (but you were probably going to do that anyways)

Elliott Neil Clark added a comment - 19/Dec/14 00:53

Yeah +1 on getting it into 1.0. For me I think that probably means with only one wal configured by default.
For what it's worth, we've seen larger improvements from running multi wals but we also have more disks per machine. So could be cluster dependent.

Elliott Neil Clark added a comment - 19/Dec/14 00:53 Yeah +1 on getting it into 1.0. For me I think that probably means with only one wal configured by default. For what it's worth, we've seen larger improvements from running multi wals but we also have more disks per machine. So could be cluster dependent.

Sean Busbey added a comment - 19/Dec/14 03:24

sounds great. I'll leave the default wal provider as is (a single wal to the filesystem) and make it so that if you configure multiwal you start at 2.

enis, you fine with me pushing this to branch-1.0 (either now or after you cut RC0)? or just stick with branch-1? It changes nothing by default but allows some alternate wal configurations.

Sean Busbey added a comment - 19/Dec/14 03:24 sounds great. I'll leave the default wal provider as is (a single wal to the filesystem) and make it so that if you configure multiwal you start at 2. enis , you fine with me pushing this to branch-1.0 (either now or after you cut RC0)? or just stick with branch-1? It changes nothing by default but allows some alternate wal configurations.

Sean Busbey added a comment - 19/Dec/14 16:48

Here's a proposed release note. Let me know if anyone thinks there's a big gap.

Sean Busbey added a comment - 19/Dec/14 16:48 Here's a proposed release note. Let me know if anyone thinks there's a big gap.

Sean Busbey added a comment - 19/Dec/14 16:49

Pushed to branch-1 and master. If it looks like there'll be a RC1 for 1.0 we can revisit pushing to branch-1.0.

Sean Busbey added a comment - 19/Dec/14 16:49 Pushed to branch-1 and master. If it looks like there'll be a RC1 for 1.0 we can revisit pushing to branch-1.0.

Hudson added a comment - 19/Dec/14 17:42

FAILURE: Integrated in HBase-1.1 #9 (See https://builds.apache.org/job/HBase-1.1/9/)
~~HBASE-5699~~ Adds multiple WALs per Region Server based on groups of regions. (busbey: rev 2b94aa8c8599d66858e6a2b5198bd21ded2334ca)

hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java
hbase-server/src/test/java/org/apache/hadoop/hbase/wal/TestBoundedRegionGroupingProvider.java
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/BoundedRegionGroupingProvider.java

Hudson added a comment - 19/Dec/14 17:42 FAILURE: Integrated in HBase-1.1 #9 (See https://builds.apache.org/job/HBase-1.1/9/ ) HBASE-5699 Adds multiple WALs per Region Server based on groups of regions. (busbey: rev 2b94aa8c8599d66858e6a2b5198bd21ded2334ca) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java hbase-server/src/test/java/org/apache/hadoop/hbase/wal/TestBoundedRegionGroupingProvider.java hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java hbase-server/src/main/java/org/apache/hadoop/hbase/wal/BoundedRegionGroupingProvider.java

Hudson added a comment - 19/Dec/14 18:54

SUCCESS: Integrated in HBase-TRUNK #5949 (See https://builds.apache.org/job/HBase-TRUNK/5949/)
~~HBASE-5699~~ Adds multiple WALs per Region Server based on groups of regions. (busbey: rev f1c41e307e4e55e7849f35e09c3de2fcd4dbbd2b)

hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java
hbase-server/src/test/java/org/apache/hadoop/hbase/wal/TestBoundedRegionGroupingProvider.java
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/BoundedRegionGroupingProvider.java

Hudson added a comment - 19/Dec/14 18:54 SUCCESS: Integrated in HBase-TRUNK #5949 (See https://builds.apache.org/job/HBase-TRUNK/5949/ ) HBASE-5699 Adds multiple WALs per Region Server based on groups of regions. (busbey: rev f1c41e307e4e55e7849f35e09c3de2fcd4dbbd2b) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java hbase-server/src/test/java/org/apache/hadoop/hbase/wal/TestBoundedRegionGroupingProvider.java hbase-server/src/main/java/org/apache/hadoop/hbase/wal/BoundedRegionGroupingProvider.java

Lars Hofhansl added a comment - 19/Dec/14 19:54

Altering the WAL provider used by a particular RegionServer requires restarting that instance.

It requires restarting the cluster, right? Otherwise when an RS dies there might another up still with a different wal provider.

Lars Hofhansl added a comment - 19/Dec/14 19:54 Altering the WAL provider used by a particular RegionServer requires restarting that instance. It requires restarting the cluster, right? Otherwise when an RS dies there might another up still with a different wal provider.

Sean Busbey added a comment - 19/Dec/14 20:06 - edited

Altering the WAL provider used by a particular RegionServer requires restarting that instance.

It requires restarting the cluster, right? Otherwise when an RS dies there might another up still with a different wal provider.

In general, yes. In the case of the providers we currently allow via configuration parameters (as opposed to user-provided custom FQCN), they all are compatible on the recovery side. So it doesn't matter if a RS dies while there are different providers around. That's why a rolling restart can be used to change from default to multiwal.

In the release note I was focused on the specific case of default vs multiwal. Should I note the general case?

Sean Busbey added a comment - 19/Dec/14 20:06 - edited Altering the WAL provider used by a particular RegionServer requires restarting that instance. It requires restarting the cluster, right? Otherwise when an RS dies there might another up still with a different wal provider. In general, yes. In the case of the providers we currently allow via configuration parameters (as opposed to user-provided custom FQCN), they all are compatible on the recovery side. So it doesn't matter if a RS dies while there are different providers around. That's why a rolling restart can be used to change from default to multiwal. In the release note I was focused on the specific case of default vs multiwal. Should I note the general case?

Michael Stack added a comment - 23/Dec/14 19:37

enis this for 1.0?

Michael Stack added a comment - 23/Dec/14 19:37 enis this for 1.0?

Enis Soztutar added a comment - 03/Jan/15 01:06

Pushed to 1.0.0 as well. All new code, should not affect default stability.

Enis Soztutar added a comment - 03/Jan/15 01:06 Pushed to 1.0.0 as well. All new code, should not affect default stability.

Hudson added a comment - 03/Jan/15 03:00

SUCCESS: Integrated in HBase-1.0 #627 (See https://builds.apache.org/job/HBase-1.0/627/)
~~HBASE-5699~~ Adds multiple WALs per Region Server based on groups of regions. (enis: rev 2ebeddfc4276898e42e9d7cadb8092e9f72ed421)

hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/BoundedRegionGroupingProvider.java
hbase-server/src/test/java/org/apache/hadoop/hbase/wal/TestBoundedRegionGroupingProvider.java

Hudson added a comment - 03/Jan/15 03:00 SUCCESS: Integrated in HBase-1.0 #627 (See https://builds.apache.org/job/HBase-1.0/627/ ) HBASE-5699 Adds multiple WALs per Region Server based on groups of regions. (enis: rev 2ebeddfc4276898e42e9d7cadb8092e9f72ed421) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java hbase-server/src/main/java/org/apache/hadoop/hbase/wal/BoundedRegionGroupingProvider.java hbase-server/src/test/java/org/apache/hadoop/hbase/wal/TestBoundedRegionGroupingProvider.java

Enis Soztutar added a comment - 21/Feb/15 23:50

Closing this issue after 1.0.0 release.

Enis Soztutar added a comment - 21/Feb/15 23:50 Closing this issue after 1.0.0 release.

People

Assignee:: Sean Busbey

Reporter:: Lijin Bin

Votes:: 0 Vote for this issue

Watchers:: 62 Start watching this issue

Dates

Created:: 02/Apr/12 18:18

Updated:: 09/Jul/18 20:41

Resolved:: 19/Dec/14 16:49