The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.
Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.
OK, though, I think we should somehow exclude these CPU cache effects
from the test. In a real production situation, lots of other things
are competing for that cache, so I think the mostly-cache-miss case is
most realistic here.
Actually a good way to do that is to test it in a wider context, eg
once I can get
LUCENE-2186 "integrated", we can test sorting by int
field with these different ways of encoding.
I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value).
OK I have a start at this under
LUCENE-2186. I'll try to clean up &
The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.
The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.
I think the writing API can be a write-once API, where I declare the
maxValue I will write. This fits with the indexing chain, where we
buffer values in RAM and then flush to the segmetn files.
This way we don't need the complexity of having to resize the bits
internally when a new max value is seen.
So, the factory would take number of values we will write, the max
value, and I guess an enum for Packed/Aligned/DedicatedArray, and
return a simple writer that just exposes an "add(long value)" method?
A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long as the internal structure, but if ints are accepted as the only valid values, selecting long as backing array for 64 bit machines and int for 32 bit, might be the solution.
Hmm... I guess we could still support values requiring more than 32
bits, encoded into int, but it's more hairy as the value could span
2 or 3 ints.
Probably we should specialize/default the factory selection based on
whether the JVM is 32/64 bit? And, whether the bit size is <= 32.
All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.
I made some small tweaks to improve performance and added long-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.
There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:
1. The penalty of using long instead of int on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int or long is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.
2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int is blazingly fast.
3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.
Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int as the internal structure for now. Using int as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?
I think we should agree on an API that
LUCENE-2186 can use to
write/read packed ints, then get our two patches talking. I'll pull
out the barebones packed ints I'm currently using, and post
here... then let's merge/iterate to a common API, so I can cutover to
the patch from here in