[HBASE-2457] RS gets stuck compacting region ad infinitum - ASF JIRA

Todd Lipcon added a comment - 16/Apr/10 05:05

Here's the full log and stack of the stuck RS

Todd Lipcon added a comment - 16/Apr/10 05:05 Here's the full log and stack of the stuck RS

Todd Lipcon added a comment - 16/Apr/10 05:08

Audit logs are just showing the "stuck" RS doing this:

2010-04-15 22:08:00,574 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=listStatus src=/hbase/test1/1736416594/actions dst=null perm=null
2010-04-15 22:08:00,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=mkdirs src=/hbase/test1/compaction.dir/1736416594 dst=null perm=todd:supergroup:rwxr-xr-x
2010-04-15 22:08:00,577 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=delete src=/hbase/test1/compaction.dir/1736416594 dst=null perm=null
2010-04-15 22:08:01,575 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=listStatus src=/hbase/test1/1736416594/actions dst=null perm=null

(ie a noop compaction)

Todd Lipcon added a comment - 16/Apr/10 05:08 Audit logs are just showing the "stuck" RS doing this: 2010-04-15 22:08:00,574 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=listStatus src=/hbase/test1/1736416594/actions dst=null perm=null 2010-04-15 22:08:00,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=mkdirs src=/hbase/test1/compaction.dir/1736416594 dst=null perm=todd:supergroup:rwxr-xr-x 2010-04-15 22:08:00,577 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=delete src=/hbase/test1/compaction.dir/1736416594 dst=null perm=null 2010-04-15 22:08:01,575 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=listStatus src=/hbase/test1/1736416594/actions dst=null perm=null (ie a noop compaction)

Todd Lipcon added a comment - 16/Apr/10 05:24

Managed to use the log level servlet to get debug logs of the no-op compactions:

2010-04-15 22:24:05,594 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region test1,4993900000,1271390277054 has too many store files, putting it back at the end of the flush queue.
2010-04-15 22:24:05,594 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region test1,4993900000,1271390277054/1736416594 because: regionserver/192.168.42.43:60020.cacheFlusher
2010-04-15 22:24:05,594 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region test1,4993900000,1271390277054
2010-04-15 22:24:05,597 DEBUG org.apache.hadoop.hbase.regionserver.Store: Skipped compaction of 1 file; compaction size of actions: 231.5m; Skipped 7 files, size: 242232921
2010-04-15 22:24:05,598 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region test1,4993900000,1271390277054 in 0sec

Todd Lipcon added a comment - 16/Apr/10 05:24 Managed to use the log level servlet to get debug logs of the no-op compactions: 2010-04-15 22:24:05,594 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region test1,4993900000,1271390277054 has too many store files, putting it back at the end of the flush queue. 2010-04-15 22:24:05,594 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region test1,4993900000,1271390277054/1736416594 because: regionserver/192.168.42.43:60020.cacheFlusher 2010-04-15 22:24:05,594 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region test1,4993900000,1271390277054 2010-04-15 22:24:05,597 DEBUG org.apache.hadoop.hbase.regionserver.Store: Skipped compaction of 1 file; compaction size of actions: 231.5m; Skipped 7 files, size: 242232921 2010-04-15 22:24:05,598 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region test1,4993900000,1271390277054 in 0sec

Todd Lipcon added a comment - 16/Apr/10 05:44

Looks like the compaction code is basically deciding that there's nothing worth compacting here. Here are the files in the order that the compaction code is looking at them (oldest to newest) with their sizes

7876304170849844721 2010-04-16 03:58:04
-rw-r--r--   3 todd supergroup  134676288 2010-04-15 20:58 /hbase/test1/1736416594/actions/7876304170849844721
3289606064411356106 2010-04-16 03:59:43
-rw-r--r--   3 todd supergroup   61309324 2010-04-15 20:59 /hbase/test1/1736416594/actions/3289606064411356106
4995952634622872563 2010-04-16 04:00:42
-rw-r--r--   3 todd supergroup   24381446 2010-04-15 21:00 /hbase/test1/1736416594/actions/4995952634622872563
6876448782185582993 2010-04-16 04:01:03
-rw-r--r--   3 todd supergroup   12103493 2010-04-15 21:01 /hbase/test1/1736416594/actions/6876448782185582993
7381870009831588255 2010-04-16 04:01:18
-rw-r--r--   3 todd supergroup    5855317 2010-04-15 21:01 /hbase/test1/1736416594/actions/7381870009831588255
5274234463019618354 2010-04-16 04:01:23
-rw-r--r--   3 todd supergroup    2712677 2010-04-15 21:01 /hbase/test1/1736416594/actions/5274234463019618354
3688488928995595533 2010-04-16 04:01:32
-rw-r--r--   3 todd supergroup    1194376 2010-04-15 21:01 /hbase/test1/1736416594/actions/3688488928995595533
5321716733066884905 2010-04-16 04:01:33
-rw-r--r--   3 todd supergroup     532824 2010-04-15 21:01 /hbase/test1/1736416594/actions/5321716733066884905

Todd Lipcon added a comment - 16/Apr/10 05:44 Looks like the compaction code is basically deciding that there's nothing worth compacting here. Here are the files in the order that the compaction code is looking at them (oldest to newest) with their sizes 7876304170849844721 2010-04-16 03:58:04 -rw-r--r-- 3 todd supergroup 134676288 2010-04-15 20:58 /hbase/test1/1736416594/actions/7876304170849844721 3289606064411356106 2010-04-16 03:59:43 -rw-r--r-- 3 todd supergroup 61309324 2010-04-15 20:59 /hbase/test1/1736416594/actions/3289606064411356106 4995952634622872563 2010-04-16 04:00:42 -rw-r--r-- 3 todd supergroup 24381446 2010-04-15 21:00 /hbase/test1/1736416594/actions/4995952634622872563 6876448782185582993 2010-04-16 04:01:03 -rw-r--r-- 3 todd supergroup 12103493 2010-04-15 21:01 /hbase/test1/1736416594/actions/6876448782185582993 7381870009831588255 2010-04-16 04:01:18 -rw-r--r-- 3 todd supergroup 5855317 2010-04-15 21:01 /hbase/test1/1736416594/actions/7381870009831588255 5274234463019618354 2010-04-16 04:01:23 -rw-r--r-- 3 todd supergroup 2712677 2010-04-15 21:01 /hbase/test1/1736416594/actions/5274234463019618354 3688488928995595533 2010-04-16 04:01:32 -rw-r--r-- 3 todd supergroup 1194376 2010-04-15 21:01 /hbase/test1/1736416594/actions/3688488928995595533 5321716733066884905 2010-04-16 04:01:33 -rw-r--r-- 3 todd supergroup 532824 2010-04-15 21:01 /hbase/test1/1736416594/actions/5321716733066884905

Michael Stack added a comment - 16/Apr/10 05:52

Here is the code:

      if (!majorcompaction && !references) {
        // Here we select files for incremental compaction.  
        // The rule is: if the largest(oldest) one is more than twice the 
        // size of the second, skip the largest, and continue to next...,
        // until we meet the compactionThreshold limit.
        for (point = 0; point < countOfFiles - 1; point++) {
          if ((fileSizes[point] < fileSizes[point + 1] * 2) && 
               (countOfFiles - point) <= maxFilesToCompact) {
            break;
          }
          skipped += fileSizes[point];
        }
        filesToCompact = new ArrayList<StoreFile>(filesToCompact.subList(point,
          countOfFiles));
...

Todd did a listing of the Store:

-rw-r--r--   3 todd supergroup  134676288 2010-04-15 20:58 /hbase/test1/1736416594/actions/7876304170849844721
-rw-r--r--   3 todd supergroup   61309324 2010-04-15 20:59 /hbase/test1/1736416594/actions/3289606064411356106
-rw-r--r--   3 todd supergroup   24381446 2010-04-15 21:00 /hbase/test1/1736416594/actions/4995952634622872563
-rw-r--r--   3 todd supergroup    1194376 2010-04-15 21:01 /hbase/test1/1736416594/actions/3688488928995595533
-rw-r--r--   3 todd supergroup    2712677 2010-04-15 21:01 /hbase/test1/1736416594/actions/5274234463019618354
-rw-r--r--   3 todd supergroup     532824 2010-04-15 21:01 /hbase/test1/1736416594/actions/5321716733066884905
-rw-r--r--   3 todd supergroup   12103493 2010-04-15 21:01 /hbase/test1/1736416594/actions/6876448782185582993
-rw-r--r--   3 todd supergroup    5855317 2010-04-15 21:01 /hbase/test1/1736416594/actions/7381870009831588255

If you trace, the olders is > 2 * the next oldest, and so on. Some of the times are same so not sure how that plays out.... (the above is not strictly ordered ... those of same time may not be proper chronological order).

Michael Stack added a comment - 16/Apr/10 05:52 Here is the code: if (!majorcompaction && !references) { // Here we select files for incremental compaction. // The rule is: if the largest(oldest) one is more than twice the // size of the second, skip the largest, and continue to next..., // until we meet the compactionThreshold limit. for (point = 0; point < countOfFiles - 1; point++) { if ((fileSizes[point] < fileSizes[point + 1] * 2) && (countOfFiles - point) <= maxFilesToCompact) { break ; } skipped += fileSizes[point]; } filesToCompact = new ArrayList<StoreFile>(filesToCompact.subList(point, countOfFiles)); ... Todd did a listing of the Store: -rw-r--r-- 3 todd supergroup 134676288 2010-04-15 20:58 /hbase/test1/1736416594/actions/7876304170849844721 -rw-r--r-- 3 todd supergroup 61309324 2010-04-15 20:59 /hbase/test1/1736416594/actions/3289606064411356106 -rw-r--r-- 3 todd supergroup 24381446 2010-04-15 21:00 /hbase/test1/1736416594/actions/4995952634622872563 -rw-r--r-- 3 todd supergroup 1194376 2010-04-15 21:01 /hbase/test1/1736416594/actions/3688488928995595533 -rw-r--r-- 3 todd supergroup 2712677 2010-04-15 21:01 /hbase/test1/1736416594/actions/5274234463019618354 -rw-r--r-- 3 todd supergroup 532824 2010-04-15 21:01 /hbase/test1/1736416594/actions/5321716733066884905 -rw-r--r-- 3 todd supergroup 12103493 2010-04-15 21:01 /hbase/test1/1736416594/actions/6876448782185582993 -rw-r--r-- 3 todd supergroup 5855317 2010-04-15 21:01 /hbase/test1/1736416594/actions/7381870009831588255 If you trace, the olders is > 2 * the next oldest, and so on. Some of the times are same so not sure how that plays out.... (the above is not strictly ordered ... those of same time may not be proper chronological order).

Michael Stack added a comment - 16/Apr/10 06:45

Ugly suggested bandaid so can cut RC: http://pastie.org/922715

Michael Stack added a comment - 16/Apr/10 06:45 Ugly suggested bandaid so can cut RC: http://pastie.org/922715

Todd Lipcon added a comment - 16/Apr/10 06:57

This issue got me thinking about why the heuristic is the way it is, and I don't quite follow.

I stepped away for a minute and tried to come up with a cost model for why it is we do minor compactions. Let me run this by everyone:

The cost of a doing compaction is the cost of reading all of the files plus the cost of writing the newly compacted file. The new file might be a bit smaller than the sum of the originals, so the total cost = sum(compacted file size) + writes cost factor * reduction factor * sum(compacted file size). Let's call this (1 + WF)*sum(file sizes)
The cost of not doing a compaction is just that reads are slower because we have to seek more. Let's define R as some unknown factor which describes how important read performance is to us, and S the seek time for each additional store. So the cost of not doing the compaction is R*S*num stores.

So if we combine these, we want to basically minimize (1 + WF)*size(files to compact) - R*S*count(files to compact).

So here's a simple algorithm to minimize that:

sort files by increasing size
for each file:
  if (1+WF)*size(file) < R*S:
    add file to compaction set
  else:
    break

If you rearrange that inequality, you can make it: if size(file) < (R*S)/(1+WF). Basically, what this is saying is that we should just make file size the heuristic, and tune the minimum file size based on the ratio between how much we care about having a small number of stores vs saving sequential IO on compactions.

The one flaw if you look closely is that we can't actually sort by increasing size and compact some set of the smallest ones, because we have to only compact a set of files that are contiguous in the sequence. I think we can slightly tweak the algorithm, though, to optimize the same objective but take that restriction into account.

Todd Lipcon added a comment - 16/Apr/10 06:57 This issue got me thinking about why the heuristic is the way it is, and I don't quite follow. I stepped away for a minute and tried to come up with a cost model for why it is we do minor compactions. Let me run this by everyone: The cost of a doing compaction is the cost of reading all of the files plus the cost of writing the newly compacted file. The new file might be a bit smaller than the sum of the originals, so the total cost = sum(compacted file size) + writes cost factor * reduction factor * sum(compacted file size). Let's call this (1 + WF)*sum(file sizes) The cost of not doing a compaction is just that reads are slower because we have to seek more. Let's define R as some unknown factor which describes how important read performance is to us, and S the seek time for each additional store. So the cost of not doing the compaction is R*S*num stores. So if we combine these, we want to basically minimize (1 + WF)*size(files to compact) - R*S*count(files to compact). So here's a simple algorithm to minimize that: sort files by increasing size for each file: if (1+WF)*size(file) < R*S: add file to compaction set else: break If you rearrange that inequality, you can make it: if size(file) < (R*S)/(1+WF). Basically, what this is saying is that we should just make file size the heuristic, and tune the minimum file size based on the ratio between how much we care about having a small number of stores vs saving sequential IO on compactions. The one flaw if you look closely is that we can't actually sort by increasing size and compact some set of the smallest ones, because we have to only compact a set of files that are contiguous in the sequence. I think we can slightly tweak the algorithm, though, to optimize the same objective but take that restriction into account.

Jonathan Gray added a comment - 16/Apr/10 17:13

Great stuff todd. We've been doing a lot of discussion of these same heuristics recently. Some recent commits have actually changed some things that will impact what we can/can't do when deciding which files to compact. For one, deletes are no longer processed during minor compactions. We're also not using (to my knowledge) relative ages of storefiles right now for anything, so I'm not sure it still holds that we can only compact neighbors.

I plan on working at this stuff during the hackathon.

Jonathan Gray added a comment - 16/Apr/10 17:13 Great stuff todd. We've been doing a lot of discussion of these same heuristics recently. Some recent commits have actually changed some things that will impact what we can/can't do when deciding which files to compact. For one, deletes are no longer processed during minor compactions. We're also not using (to my knowledge) relative ages of storefiles right now for anything, so I'm not sure it still holds that we can only compact neighbors. I plan on working at this stuff during the hackathon.

Jonathan Gray added a comment - 16/Apr/10 17:22

I'm +1 to commit the bandaid patch until we can get smarter.

Jonathan Gray added a comment - 16/Apr/10 17:22 I'm +1 to commit the bandaid patch until we can get smarter.

Michael Stack added a comment - 16/Apr/10 17:40

Thanks Todd. I like your cut at a better algorithm. As per Jon, it may be the case that the contiguous-only no longer applies after yesterdays commit that leaves deletes in files until major compaction. Also, anything that improves our i/o profile is ++ultra super important.

So, I suggest that we move the compaction review out of this issue into a new blocker against 0.20.5. This new issue would be about trying your suggestion and breaking out compaction so it standalone/testable. You're interested in this one Jon?

For this issue, for 0.20.4, I suggest we apply the band-aid. For the super pathological case you turned up above, the band-aid will mean we only do two files at a time compressing – way sub-optimal but at least we wouldn't be stuck.

Michael Stack added a comment - 16/Apr/10 17:40 Thanks Todd. I like your cut at a better algorithm. As per Jon, it may be the case that the contiguous-only no longer applies after yesterdays commit that leaves deletes in files until major compaction. Also, anything that improves our i/o profile is ++ultra super important. So, I suggest that we move the compaction review out of this issue into a new blocker against 0.20.5. This new issue would be about trying your suggestion and breaking out compaction so it standalone/testable. You're interested in this one Jon? For this issue, for 0.20.4, I suggest we apply the band-aid. For the super pathological case you turned up above, the band-aid will mean we only do two files at a time compressing – way sub-optimal but at least we wouldn't be stuck.

Todd Lipcon added a comment - 16/Apr/10 17:46

I thought about this a bit on the train this morning (guess this long commute is good for something!)

The real root of this problem is that we are getting "noop" compactions when we are blocked on a compaction happening. Rather than changing the compaction algorithm like in stack's patch, could we simply make it so that it checks whether there are currently flushes blocked on a compaction, and if that's the case, force one to happen?

Todd Lipcon added a comment - 16/Apr/10 17:46 I thought about this a bit on the train this morning (guess this long commute is good for something!) The real root of this problem is that we are getting "noop" compactions when we are blocked on a compaction happening. Rather than changing the compaction algorithm like in stack's patch, could we simply make it so that it checks whether there are currently flushes blocked on a compaction, and if that's the case, force one to happen?

Michael Stack added a comment - 17/Apr/10 04:31

.bq ...could we simply make it so that it checks whether there are currently flushes blocked on a compaction, and if that's the case, force one to happen?

Thats not easy given current structure.

Here's a second cut at a band-aid. This differs from first in that we'll compact the last 4 files in the list at a minimum. Review please. I'm testing it now.

Michael Stack added a comment - 17/Apr/10 04:31 .bq ...could we simply make it so that it checks whether there are currently flushes blocked on a compaction, and if that's the case, force one to happen? Thats not easy given current structure. Here's a second cut at a band-aid. This differs from first in that we'll compact the last 4 files in the list at a minimum. Review please. I'm testing it now.

Michael Stack added a comment - 17/Apr/10 04:33

@Todd BTW, that posted log is crazy. Flush at 2MB!

Michael Stack added a comment - 17/Apr/10 04:33 @Todd BTW, that posted log is crazy. Flush at 2MB!

Michael Stack added a comment - 17/Apr/10 16:12

All tests w/ this patch in place. I've been running it overnight under cluster loading and nothing untoward. We might be compacting a bit more aggressively now doing the last 4 files in list if > 4 but that is probably for the better.

Michael Stack added a comment - 17/Apr/10 16:12 All tests w/ this patch in place. I've been running it overnight under cluster loading and nothing untoward. We might be compacting a bit more aggressively now doing the last 4 files in list if > 4 but that is probably for the better.

Michael Stack added a comment - 17/Apr/10 19:49

Ok. I applied the bandaid so I can make an RC. On commit I changed it so we compact the last 3 files rather than the last 4 to be a little more conservative. I did not apply to TRUNK.

Michael Stack added a comment - 17/Apr/10 19:49 Ok. I applied the bandaid so I can make an RC. On commit I changed it so we compact the last 3 files rather than the last 4 to be a little more conservative. I did not apply to TRUNK.

Andrew Kyle Purtell added a comment - 19/Apr/10 14:46

[[ Old comment, sent by email on Fri, 16 Apr 2010 17:45:20 +0000 ]]

+1 on Stack's strategy to get a RC out the door (today?)

Andrew Kyle Purtell added a comment - 19/Apr/10 14:46 [[ Old comment, sent by email on Fri, 16 Apr 2010 17:45:20 +0000 ]] +1 on Stack's strategy to get a RC out the door (today?)

Michael Stack added a comment - 12/May/10 23:52

Marking these as fixed against 0.21.0 rather than against 0.20.5.

Michael Stack added a comment - 12/May/10 23:52 Marking these as fixed against 0.21.0 rather than against 0.20.5.

Michael Stack added a comment - 17/May/10 22:24

I applied this to TRUNK. It wasn't there.

Michael Stack added a comment - 17/May/10 22:24 I applied this to TRUNK. It wasn't there.

Lars Francke added a comment - 20/Nov/15 12:42

This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

Lars Francke added a comment - 20/Nov/15 12:42 This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

HBase

RS gets stuck compacting region ad infinitum

Details

Description

Attachments

Attachments

Activity

People

Dates