Thanks a lot for the review Jun! I'll respond to #1 in a separate comment, since it is a design issue that still needs a lot of discussion.
> 2.1 In MerkleTree.Node.insert, why do you increment the depth of the left child even when the node doesn't split?
Node.insert() is only used during split operations (perhaps it is misnamed... but this is not quite a traditional B-Tree). The child to the left is the node that contains the Token we were splitting on, and whenever we split a range we increment its depth to indicate how far it is from being a complete (0,0] range. I'll add a comment to this effect.
> 2.2 In the same function, if the node does split, where is the code to shrink the children list in the splitted node to half?
Node.insert() uses List.subList() and List.clear() inline to take half of its neighbor's children.
> 2.3 In the same function, do you have to keep calling invalidate during insertion?
The original design assumed that the tree was going to live for a while in memory, and be maintained between repair sessions, so the split operation is intended to be used on a tree that might be partially valid. We might be able to have the initial building of the tree skip this check somehow, but I don't think the 'hash == null? hash = null;' check is too intensive.
> 3.1 In Validator.add, there is comment about generating a new range. However, no code does that.
The "private MerkleTree.TreeRangeIterator ranges" variable is an iterator generated by the MerkleTree: it iterates in order over all of the ranges in the tree that have null hashes.
> 3.2 In TreeRange.validateHelper [...] Why do you have to compute multiple hash values recursively?
The reasoning here is that a MerkleTree is supposed to be a sparse representation of a 'perfect/complete' binary tree. Each leaf of the perfect tree represents a hashed range, and each inner node represents a binary hash of its two children. The perfect tree is of depth "hashdepth", so when validateHelper() reaches the maximum/hashdepth, it is in one of the leaves of the perfect tree, and rows are hashed sequentially there. <EDIT comment="Ignore everything in in this tag">If a single call to validate() starts in a TreeRange that is at depth == maxdepth, then what is stored in the MkT.Node is the value of a perfect leaf, otherwise, the MkT.Node is storing the value of a perfect inner node.</EDIT>
I'll probably copy and past this exact explanation into the next revision =x
> 4. I need some text description to really follow the Differencer code.
I'll make sure this gets in the next revision, but basically:
1. It recurses using midpoint as long as both trees have the resolution to continue, and are not equal.
2a. If it finds a range that has only one invalid child, it adds that child range, since that is the smallest possible invalid range contained in the parent.
2b. Otherwise, if both children are invalid (and since it can't recurse deeper), the parent range is entirely invalid, and recursion keeps rolling up until 2a is met.
> 5. The Hashable class is confusing.
You're right: I definitely should have worried more about overloading the word "hash": I'll see what I can do about this one.
> 6. The repair logic is missing in Differencer.
The repair logic is the one FIXME for this ticket:
CASSANDRA-520 deals with implementing the actual repair logic.
Thanks again for taking time out to look at this!