Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.20.4
-
None
-
None
Description
Testing 0.20_pre_durability@934643, I ended up in a state where one region server got stuck compacting a single region over and over again forever. This was with a special config with very low flush threshold in order to stress test flush/compact code.
Attachments
Attachments
- ASF.LICENSE.NOT.GRANTED--log.gz
- 92 kB
- Todd Lipcon
- ASF.LICENSE.NOT.GRANTED--stack
- 34 kB
- Todd Lipcon
- ASF.LICENSE.NOT.GRANTED--bandaid-v2.patch
- 2 kB
- Michael Stack
Activity
Audit logs are just showing the "stuck" RS doing this:
2010-04-15 22:08:00,574 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=listStatus src=/hbase/test1/1736416594/actions dst=null perm=null
2010-04-15 22:08:00,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=mkdirs src=/hbase/test1/compaction.dir/1736416594 dst=null perm=todd:supergroup:rwxr-xr-x
2010-04-15 22:08:00,577 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=delete src=/hbase/test1/compaction.dir/1736416594 dst=null perm=null
2010-04-15 22:08:01,575 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=todd,todd,adm,git ip=/192.168.42.43 cmd=listStatus src=/hbase/test1/1736416594/actions dst=null perm=null
(ie a noop compaction)
Managed to use the log level servlet to get debug logs of the no-op compactions:
2010-04-15 22:24:05,594 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region test1,4993900000,1271390277054 has too many store files, putting it back at the end of the flush queue.
2010-04-15 22:24:05,594 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region test1,4993900000,1271390277054/1736416594 because: regionserver/192.168.42.43:60020.cacheFlusher
2010-04-15 22:24:05,594 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region test1,4993900000,1271390277054
2010-04-15 22:24:05,597 DEBUG org.apache.hadoop.hbase.regionserver.Store: Skipped compaction of 1 file; compaction size of actions: 231.5m; Skipped 7 files, size: 242232921
2010-04-15 22:24:05,598 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region test1,4993900000,1271390277054 in 0sec
Looks like the compaction code is basically deciding that there's nothing worth compacting here. Here are the files in the order that the compaction code is looking at them (oldest to newest) with their sizes
7876304170849844721 2010-04-16 03:58:04 -rw-r--r-- 3 todd supergroup 134676288 2010-04-15 20:58 /hbase/test1/1736416594/actions/7876304170849844721 3289606064411356106 2010-04-16 03:59:43 -rw-r--r-- 3 todd supergroup 61309324 2010-04-15 20:59 /hbase/test1/1736416594/actions/3289606064411356106 4995952634622872563 2010-04-16 04:00:42 -rw-r--r-- 3 todd supergroup 24381446 2010-04-15 21:00 /hbase/test1/1736416594/actions/4995952634622872563 6876448782185582993 2010-04-16 04:01:03 -rw-r--r-- 3 todd supergroup 12103493 2010-04-15 21:01 /hbase/test1/1736416594/actions/6876448782185582993 7381870009831588255 2010-04-16 04:01:18 -rw-r--r-- 3 todd supergroup 5855317 2010-04-15 21:01 /hbase/test1/1736416594/actions/7381870009831588255 5274234463019618354 2010-04-16 04:01:23 -rw-r--r-- 3 todd supergroup 2712677 2010-04-15 21:01 /hbase/test1/1736416594/actions/5274234463019618354 3688488928995595533 2010-04-16 04:01:32 -rw-r--r-- 3 todd supergroup 1194376 2010-04-15 21:01 /hbase/test1/1736416594/actions/3688488928995595533 5321716733066884905 2010-04-16 04:01:33 -rw-r--r-- 3 todd supergroup 532824 2010-04-15 21:01 /hbase/test1/1736416594/actions/5321716733066884905
Here is the code:
if (!majorcompaction && !references) { // Here we select files for incremental compaction. // The rule is: if the largest(oldest) one is more than twice the // size of the second, skip the largest, and continue to next..., // until we meet the compactionThreshold limit. for (point = 0; point < countOfFiles - 1; point++) { if ((fileSizes[point] < fileSizes[point + 1] * 2) && (countOfFiles - point) <= maxFilesToCompact) { break; } skipped += fileSizes[point]; } filesToCompact = new ArrayList<StoreFile>(filesToCompact.subList(point, countOfFiles)); ...
Todd did a listing of the Store:
-rw-r--r-- 3 todd supergroup 134676288 2010-04-15 20:58 /hbase/test1/1736416594/actions/7876304170849844721 -rw-r--r-- 3 todd supergroup 61309324 2010-04-15 20:59 /hbase/test1/1736416594/actions/3289606064411356106 -rw-r--r-- 3 todd supergroup 24381446 2010-04-15 21:00 /hbase/test1/1736416594/actions/4995952634622872563 -rw-r--r-- 3 todd supergroup 1194376 2010-04-15 21:01 /hbase/test1/1736416594/actions/3688488928995595533 -rw-r--r-- 3 todd supergroup 2712677 2010-04-15 21:01 /hbase/test1/1736416594/actions/5274234463019618354 -rw-r--r-- 3 todd supergroup 532824 2010-04-15 21:01 /hbase/test1/1736416594/actions/5321716733066884905 -rw-r--r-- 3 todd supergroup 12103493 2010-04-15 21:01 /hbase/test1/1736416594/actions/6876448782185582993 -rw-r--r-- 3 todd supergroup 5855317 2010-04-15 21:01 /hbase/test1/1736416594/actions/7381870009831588255
If you trace, the olders is > 2 * the next oldest, and so on. Some of the times are same so not sure how that plays out.... (the above is not strictly ordered ... those of same time may not be proper chronological order).
This issue got me thinking about why the heuristic is the way it is, and I don't quite follow.
I stepped away for a minute and tried to come up with a cost model for why it is we do minor compactions. Let me run this by everyone:
- The cost of a doing compaction is the cost of reading all of the files plus the cost of writing the newly compacted file. The new file might be a bit smaller than the sum of the originals, so the total cost = sum(compacted file size) + writes cost factor * reduction factor * sum(compacted file size). Let's call this (1 + WF)*sum(file sizes)
- The cost of not doing a compaction is just that reads are slower because we have to seek more. Let's define R as some unknown factor which describes how important read performance is to us, and S the seek time for each additional store. So the cost of not doing the compaction is R*S*num stores.
So if we combine these, we want to basically minimize (1 + WF)*size(files to compact) - R*S*count(files to compact).
So here's a simple algorithm to minimize that:
sort files by increasing size for each file: if (1+WF)*size(file) < R*S: add file to compaction set else: break
If you rearrange that inequality, you can make it: if size(file) < (R*S)/(1+WF). Basically, what this is saying is that we should just make file size the heuristic, and tune the minimum file size based on the ratio between how much we care about having a small number of stores vs saving sequential IO on compactions.
The one flaw if you look closely is that we can't actually sort by increasing size and compact some set of the smallest ones, because we have to only compact a set of files that are contiguous in the sequence. I think we can slightly tweak the algorithm, though, to optimize the same objective but take that restriction into account.
Great stuff todd. We've been doing a lot of discussion of these same heuristics recently. Some recent commits have actually changed some things that will impact what we can/can't do when deciding which files to compact. For one, deletes are no longer processed during minor compactions. We're also not using (to my knowledge) relative ages of storefiles right now for anything, so I'm not sure it still holds that we can only compact neighbors.
I plan on working at this stuff during the hackathon.
Thanks Todd. I like your cut at a better algorithm. As per Jon, it may be the case that the contiguous-only no longer applies after yesterdays commit that leaves deletes in files until major compaction. Also, anything that improves our i/o profile is ++ultra super important.
So, I suggest that we move the compaction review out of this issue into a new blocker against 0.20.5. This new issue would be about trying your suggestion and breaking out compaction so it standalone/testable. You're interested in this one Jon?
For this issue, for 0.20.4, I suggest we apply the band-aid. For the super pathological case you turned up above, the band-aid will mean we only do two files at a time compressing – way sub-optimal but at least we wouldn't be stuck.
I thought about this a bit on the train this morning (guess this long commute is good for something!)
The real root of this problem is that we are getting "noop" compactions when we are blocked on a compaction happening. Rather than changing the compaction algorithm like in stack's patch, could we simply make it so that it checks whether there are currently flushes blocked on a compaction, and if that's the case, force one to happen?
.bq ...could we simply make it so that it checks whether there are currently flushes blocked on a compaction, and if that's the case, force one to happen?
Thats not easy given current structure.
Here's a second cut at a band-aid. This differs from first in that we'll compact the last 4 files in the list at a minimum. Review please. I'm testing it now.
All tests w/ this patch in place. I've been running it overnight under cluster loading and nothing untoward. We might be compacting a bit more aggressively now doing the last 4 files in list if > 4 but that is probably for the better.
Ok. I applied the bandaid so I can make an RC. On commit I changed it so we compact the last 3 files rather than the last 4 to be a little more conservative. I did not apply to TRUNK.
[[ Old comment, sent by email on Fri, 16 Apr 2010 17:45:20 +0000 ]]
+1 on Stack's strategy to get a RC out the door (today?)
This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).
Here's the full log and stack of the stuck RS