The first results are in, and it's looking good.

Algorithm:

For intersectionSize (and intersection), if the sets are near in size, we do a linear scan. A little micro-optimization there got us an extra 13% speedup... just for rearranging the order of comparisons base on the fact that one was more likely to be true than the other (based on the set sizes). When the set sizes differ by a factor of 8 or more, we use a modified binary search. The first modification keeps track of the lower bound rather than resetting to 0... a big win (an obvious optimization... I didn't measure the benefit). Then second modification probes closer to the lower bound (rather than using the midpoint) based on the relative size difference of the sets, and leads to a 40% performance increase.

Performance results of intersectionSize:

DocSet sizes |
Advantage |
Percent |
comments |

1-200 x 1-200 |
Int |
53% |
random sized DocSets from 1-200 in size intersected with other DocSets of size 1-200 |

1-3000 x 1-3000 |
Int |
33% |
3000 is the current default upper bound of HashDocSet |

1-3000 x 1-3000 |
Int |
130% |
-client instead of -server |

2000-3000 x 2000-3000 |
Int |
60% |
only big sets |

20000-30000 x 20000-30000 |
Int |
54% |
only bigger sets |

100-200 x 1000-2000 |
Hash |
87% |
small sets with big sets |

1-10000 x 20000-30000 |
Int |
74% |
smaller sets intersected with BitDocSets |

1-30000 x 1-30000 |
Int |
80% |
docsets over maxDoc/64 are BitDocSets (maxDoc=1M) |

So to sum up, only small sets intersected with big sets are slower.... but given that big sets intersected with big sets take a majority of the time, we get a nice net win. It gets more dramatic when intersecting a small set with a BitDocSet... these affects are probably due to nicer treatment of the memory subsystem when accessing memory in order. I think these intersections tend to be bound by memory bandwidth.

The improvements will also allow us to bump up the max size of the "small set" implementation. From a memory consumption point of view, the break-even point is maxDoc/32. When I tested using SortedIntDocSets with maxDoc/64, there was always a net speedup over maxDoc/32 and maxDoc/100, so this seems to be a good balance between performance and saving memory.

Memory savings: SortedIntDocSet is more efficient than HashDocSet at storing the same amount of data, and it can be used at larger sizes (relative to maxDoc) before performance decreases (another memory win).

Other savings: Faster set creation - Lucene currently delivers docs in order, hence so sorting step is needed after collection.

Other notes: I tried a cool partitioning algorithm that I thought would be superior - take the middle element of the big set and use it to partition the small set. Say you have set sizes of 100 and 10000... you do a single binary search on the small set, and now for all 100 elements you have half the big set size to search. Recurse on the corresponding lower and upper partitions until they are small enough to use a different method such as a modified binary search. This approach worked, but it wasn't able to beat the modified binary search alone once I put in all the optimizations... **shrugs**

We need skipping in our DocSets in order to be able to pass them as filters and improve search performance.

One could Sort a HashDocSet every time it's used.... but that's not desirable.

The way Lucene now uses filters, random access performance is no longer important, but being able to efficiently skip is.

Intersection performance is very important to Solr of course, so if we can get SortedIntDocSet performance fast enough then it would make sense for it to replace HashDocSet.

I've already started working on this, and the results look promising. I'll post a draft soon.