[COUCHDB-988] Rewrite couch_key_tree.erl - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Database Core
Labels:
None

Skill Level:
Committers Level (Medium to Hard)

Description

The current key tree module is a fairly complicated piece of code with enough subtlety to fill three romance novels. This ticket is a proposal to rewrite the current module with an algorithm that will be more easy to reason about. Here I'll write a brief explanation of the current algorithms, and then a short proposal for a simpler algorithm.

A key tree is used to represent the edit history of a document. Each node of the tree represents a particular version of the document. Relations between nodes represent the order that these edits were applied. For instance, a simple tree with a single linear path A->B->C means that the edit C was based on the version B which was in turn based on A. In a world without replication (and no ability to disable MVCC checks), all histories would be forced to be linear lists of edits due to constraints imposed by MVCC (ie, new edits must be based on the current version). However, we have replication, so we must deal with not so easy cases.

Consider a document in state A. This doc is replicated to a second node. We then edit the document one each node leaving it in two different states, B and C. We now have two key trees, A->B and A->C. When we go to replicate a second time, the key tree must combine these two trees which gives us A->(B|C). For the astute reader, this is how conflicts are introduced. In terms of the key tree, we say that we have two leaves (B and C) that are not deleted. Hence, conflict. To remove a conflict, one of the edits (B or C) can be deleted, which results in, A->(B|C->D) where D is an edit that is specially marked with a deleted=true flag.

Also, of note which will help us down the line, remember that there's another completely different possible mode of operation. Given two couchdb instances, we could create two different docs with the same id. Now we have two documents with edit histories E and F. Now after replication we have edit history (E|D) but really, this isn't a tree. Its two roots to two different trees. Our algorithms still work, we just have to consider multiple trees when we apply them. Remember this after this next aside.

Hopefully from that brief description of simple operations its fairly intuitive how we can end up with arbitrarily deep and branching trees due to distributed edits. Of note here is that this operation parallels the nature of Git's commit model. Each commit sha1 depends on the parent sha1 creating a sort of inverted merkle tree.

At this point I'll mention a quick point on the implementation of these trees. In Erlang, they are implemented as a list of nested tuples with the roots being the most shallow node. For instance, A->B->C could be represented as roughly, [{A, [{B, [

{C}

]}]}] which in english: "a single element list of a two element tuple, where the first element is the key, and the second element is a single element list of a two tuple....turtles."

The reason I make this note on implementation is that it should be apparent that the depth of this data structure is going to be linearly dependent on the edit history of a document. As we found out quickly, Erlang also apparently has some core C functions that are written recursively. As everyone knows about recursion in C, there is a point where one more turtle is one too many. Or, in other words, we run out of room on the stack to allocate more frames.

At this point we had a couple options. Patch Erlang (we did, it was rejected IIRC). Change our representation (could work, but would break some of the algorithmic assumptions, especially with how Erlang's SSA works). Or, add pruning to the revision history. This is what revision stemming is.

Revision stemming in a nutshell means "Keep only the last N edits". But then its a tree, so its actually, "keep the last N edits per leaf." The tradeoff here is that stemming may introduce a conflict by accident. For instance, Given A->B on one node, replicate to a second, edit it so that its A->B->C->D. Then applying stemming with an N of 2, we end up with A->B on node 1, and C->D on node 2. After replication, this turns into (A->B|C->D) ie, we weren't able to see that C was a descendant of B. This is generally considered an acceptable tradeoff. You just need to make sure you replicate more often than the N rev_stemming edits.

The second important note about rev_stemming is that it changes our storage of trees. The current implementation keeps track of how many edits have been discarded. Ie, in revisions you see in documents of the format "Number-Hash", that number is basically how deep that revision is in the path from root. We then leverage this information when merging stemmed trees so that we know how deep we need to look to start matching edits.

One final minor piece of implementation. Each key in the tree also carries a value. This can either be a document body about to be written, or perhaps the location in the file where that revision can be read.

The bug that was uncovered in ~~COUCHDB-968~~ was that the start-of-tree position information was used in such a way that we ended up choosing the wrong value. Ie, we were choosing to re-write the same document instead of discarding it for the version that was already on disk. This ended up cascading through about 3 critical sections causing the end result of dupes in _changes and _all_docs after compaction.

The fix that Adam has put together was to first fix the decision making process by making a specific preference for which tree's value is used for identical revisions. Due to more than a few subtleties this has take a bit to get right with all the intervening key tree code. Hence, my desire to rewrite it.

The last thing I'll write up is an off the cuff approach that I thought of this morning. I don't necessarily think this is the only way to rewrite the key tree, but it seems easier (before coding). If anyone has ideas for other approaches that's just as well as the only goal here is to reduce the complexity factor.

The current code was originally designed around the concept of rooted trees of edits. Ie, Git. As we added features like stemming, the actual problem has shifted towards considering a larger set of possibly unrelated trees. If we add one constraint, it is possible for us to reconsider the algorithms in terms of more general graph approaches. That one extra constraint:

1. keys (ie, revisions) must be unique within the set of trees (ie, single docid).

The general algorithm for merging N trees is something like:

1. For each tree, create a list of parent/child edges.
2. Find the union of all edges.
3. Rebuild the tree.

There'd be more to work out of course. Depth would disappear, which would affect revision patterns visible to the client. Though BigCouch has changed revisions as well. I'm sure there are probably other things to work out.

So, that's all I've got right now. Its a bit of a twinkle in the eye, but I figured I'd get my thoughts on the current code down on paper so I don't have to re-think them in the future.

Rewrite couch_key_tree.erl

Details

Description

Attachments

Attachments

Activity

People

Dates