Had some fun rebasing, but think everything looks good now. A few things to note:
1) I'm not sure what you mean by not "serializing those" - for correctness I serialize all of the data in a node. Do you want me to change the serialization methods to not send these values? I don't log them the other end, but I would prefer they were sent to ensure no surprises for users of the data, and also because of some optimisations to difference() that rely on knowing the number of rows for each sub-tree. It's not a tremendous amount of data after all.
2) I've modified DifferencerTest, and created two versions of the testDifference() method - one that tests differences on an empty tree, and one which tests a tree that has been populated with rows. Previously only the former was tested. This is because the changes I made to difference() for my previous patch, which I have retained and which ensures contiguous ranges are emitted where possible, treats the entire empty tree as one contiguous difference range (since the only non-empty sub-range in the tree is different), which was breaking the previous test. This test now works with the fully populated tree, and the previous test now confirms that the whole tree is considered different when it is empty. It's possible you may want to not deploy these improvements in this patch, but it seems a good idea to me whilst it's being modified, and given that I'd made the change already. Since we're not logging the ranges themselves at this time it won't have any direct impact, but it will be useful if that ever changes, and might help with future debugging.
3) I've updated the MerkleTreeTest methods to test the serialization and difference changes, and introduced a new HistogramBuilderTest
4) The histogram is built differently from my first patch, and is described in HistogramBuilder. Basically rather than creating neat linear ranges, I calculate the mean and create ranges that are multiples of the standard deviation either side of the mean, up to min/max (or, in this case, 3 stdevs, plus one range to min/max)
5) One thing we might want to consider changing is the format of the EstimatedHistogram ranges in the log messages. I've reproduced faithfully the boundary conventions of the EstimatedHistogram, but this is not a user friendly convention - it has an exclusive lower bound and inclusive upper bound, as opposed to the typical opposite convention. As such we get ranges like (-1, 0] to represent the range containing only 0, as opposed to [0, 1)
Think that's everything. Should respond quickly to queries at the moment, so drop me a line if you have any questions.