(this was written together with erik holstad)
We don't have any profiling stats or end-to-end numbers at the moment, but we do have are some early naive timing test that compare the 2 implementations and already for very few inserts like 10 to 15 entries with no deletes in 2 storefiles + memcache I saw around 2x improvment. These test were only done on the server and I have not tested for bigger tables and from the client, since I only got the rpc to work the other day and we've done some reworks since then, but it looks promising.
When we looked at the get code and how to make it faster we started to look what structures were used and how many compares you have to do to get your result. Because to me the way to speed things up is to cut down the total number of compares that you need to do to get your result. Of course the total fetch time is made up by a number of causes, but on the server, outside of HFile/HDFS, where I think most time is spent is comparing entries to see if they should be included in the result or not (this decision made by looking at the query to see if it was asked for, comparing for deletes, checking for ttl, maxVersions requested, timeRange, etc).
So what I saw when looking at the old code is that for every new entry from the storefile you compare it to all entries in the get set and then for every match between those you have to check to see it is has been deleted.
Let k be the number KeyValues we will look at. Let l be the number of columns asked for by the client.
Old Way: We compare every entry from the storefile (k) with every entry from the getset (l) which will give you k * l compares in total.
If the getset is sorted (storefile/memcache always is), then you are just doing a sorted merge between two sorted lists.
New Way: We do a sorted merge between the entries in the storefile (k) with the entries in the sorted getset (l) which will give you k + l compares in total.
The difference between k * l and k + l can be significant if you are matching more than one column.
- current plan is to build the sortedset behind the scenes on the client in the Get
- this insertion sort will have a negligible impact on clients and no additional work will be required server-side
- we will send the sortedset over the wire as a list. in that case, it will not have to be re-sorted.
(this last optimization is important. right now we do a LOT of serializing a sorted tree, sending it over the wire, and then deserializing it. the deserializing of trees actually rebuilds the entire tree, thus reperforming the sort. using lists (which are sorted) will give us time and memory savings.)
On the delete part, we have two things to deal with. Checking current KVs against the current DeleteSet, and adding newly seen deletes into the DeleteSet.
Let k be the number of KeyValues we will look at. Let m be the number of deletes we have already seen. Let n be the number of new deletes we will see.
Old way of checking current KVs agains the current DeleteSet:
- DeleteSet is a sortedSet, so it has log(m) inserts/gets/deletes/contains operations.
- For each KV seen (k), we have a log(m) operation to see if it is deleted.
- This will give you k * log(m) total comparisons.
Old way of inserting newly seen delete KVs into current DeleteSet:
- When we see a delete, we insert it to the deleteset in log(m). This has downsides.
- We're adding deletes which we know will not be relevant in this storefile (increasing m increases subsequent delete checks so we have log(m + n) eventually).
- We're doing work which we may not potentially ever have to do (merging newly seen delete with previous set). If we are going to early-out of this storefile for some reason, there would never have been a reason to pay anything but constant-time cost for deletes related to older storefiles which will not end up being opened.
- Trees are bad! Their allocation of lots of small nodes is a disaster for GC. We should use lists everywhere we can.
- This gives you n * log(m) total comparisons.
The new approach is similiar to how we now handle the get list. Deletes will be stored in two lists. A currentDeleteList and a newDeleteList.
To start, the memcache has no deletes to look at. You will, however, add any deletes seen to the newDeleteList (in constant time, no significant extra work). Since memcache (and storefiles) are sorted, newDeleteList is sorted.
When moving to the next storefile, you will merge newDeleteList with currentDeleteList.
New way of handling deletes:
- currentDeleteList is a sorted ArrayList of deletes. there is a DeleteFamily special-case that just stores a stamp (which implies it's existence).
- We do a sorted merge of KVs seen (k) with the currentDeleteList (m), for k + m total comparisons.
- Each seen new delete is a O(1) append to the newDeleteList.
- At the end of the storefile, you will then need to merge the delete lists, for m + n total comparisons.
- This will give you (k + m) + (m + n) total comparisons for each storefile.
The difference in total is: Old = (k + n) * log(m) vs New = (k + n) + 2m
The algorithmic improvements seem clear. The replacement of Trees/SortedSets with (Sorted)Lists/ArrayLists will also be beneficial for GC/heap usage.
Additional complexity in things like merging delete lists are not a big deal if we make thorough unit tests (which we have already done in that case).