[COUCHDB-3065] compact.meta files should not bury headers under GBs of non-header blocks - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Database Core
Labels:
None

Description

When a db compaction is interrupted and restarted, it's possible for the most recent header written to the .compact.meta file to be buried under many GB from the end of file. See also https://issues.apache.org/jira/browse/COUCHDB-3061

Regarding this issue, here's an edited email Paul Davis wrote to me:

"Compaction itself is dead simple. All we do is walk the database seq_tree (ie, docs in order of their last update) and copy all related data to a new file. Before the commit you linked (and one related before it) we wrote the id_tree to the new file directly. Adam Kocoloski made the observation that writing to the id_tree in the order of the seq_tree would in most cases cause a lot of garbage to be generated because our id_tree b~tree writes would be out of order. To put that differently, if we have a list of ids and update them in a random order, during compaction the id_tree has random writes (because we're writing in order of the seq_tree). This could (and often did) cause a large amount of garbage to be generated during compaction (which is bad because that garbage was in the "compacted" file).

The commit [0] you found is the second of two major commits to take this id_tree and write it to a temp file. At the end of compaction we'd then just copy that id_tree back to the compacted file in order which then severely decreases the amount of garbage in the final compacted file. At the time I remember seeing numbers like 50Gb smaller compacted files in something like a tenth of the total time required for compaction.

Obviously, given that we're append only files on disk, when we find that a header hasn't been written for a long time its because we haven't written a header for a long time (sometimes it helps to state the obvious). So the thing to look for is periods of time that we don't write a header. You nailed the most likely thing on the head with emsort because that's the first place I would look. The decimate function in that module is a bit of a pain because it can do a shit ton of IO without any concern for things like writing headers or updating the task status. If you've ever looked at a compaction at "100%" that runs for hours I can almost guarantee that its sitting in the decimate call.

I should back up a bit. The first draft of the "use temp file during compaction" patch I wrote just moved the id_btree writing during compaction to a temp file. Then at the end it would stream that tree back into the compacted database. The obvious follow up to that is to use a different data structure to a btree in the temp file. couch_emsort is an "external merge sort" temp file thing. We write a bunch of sorted data to it and then use that to stream back and build a btree in the compaction file. However, the streaming back step we can't just open all the sorted bits we wrote because that might exhaust RAM. So decimate exists to take our partially sorted data and re-sort it to disk so that we have a limited number of places on disk to read from.

That seems really opaque. To word it differently, during compaction we collected say 100 document ids, then sort them and write them to the temp file. At the end of compaction we can't just load all collections of 100 document ids because that could easily exhaust RAM. So the decimate bit recursively takes groups of ids and merge sorts them back to disk. At a certain point we're willing to then start streaming doc ids back into the compaction file's id_tree.

Obviously, with enough document ids that whole decimation bit will do a lot of reading and writing to disk. So pulling a header from that file can be an issue. As an aside, during this time we also don't update the active task status which makes judging compaction progress on large dbs very unfun.

So to answer your question, yes, not writing a header during the decimate phase is bad and something we should do.

Fun historical fact, views used to have a similar issue. View compaction would finish and send the compacted view file off. If you had a crash after the view was swapped but before a view was updated no header was written so we'd have views that would basically never return as they searched for a header. The fix there was to write a header before view compaction swapped. That's generally what we should look at here but may not be the easiest and your patch could give us some more time to figure out a better answer to the decimate issue.

I have looked at it a bit for the active tasks issue but haven't ever figured out a good solution. I kind of just hand wave and say that maybe with a bit more bookkeeping we may be able to make progress calculations for active tasks (which would be equally useful for writing headers) but I've never followed through with a patch. If you stare at it more and come up with an idea on that part I'd be quite interested in reading about it."

[0] https://github.com/apache/couchdb-couch/commit/9d830590f8a9a699315c78b329a8e80079ed48bd

compact.meta files should not bury headers under GBs of non-header blocks

Details

Description

Attachments

Activity

People

Dates