Yonik, I think it is possible to make a reasonable assumption about the distribution of the vints. Lucene stores deltas between document IDs instead of document IDs, and I guess (no data available) most frequencies will be below 128.
In the example given below I take the bit for the frequency that is stored with the document IDs delta into account.
Suppose you have 50 million documents. With a term occurring in 1 in 50 documents 1 million 1-byte vints are needed. With a term occurring 1 in 2000 documents 25.000 2-byte vints are needed. With an occurrence of 1 in 100.000 documents 500 3-byte vints are needed. Together with my guess about the frequencies many more 1-byte and 2-byte vints than 3-byte and longer vints have to be read. So the algorithm for readVInt() should at least perform very good for 1-byte and 2-byte vints.
To detemine whether or not it is worth improving the algorighm, I wrote an isolated test: no I/O, as little GC as possible, etc. See the attached Java file.
I tested four algorithms:
- the current one;
- the unrolled one of this issue;
- a new one I came up with that uses exclusive or (see below);
- the one from Yonik.
The exclusive or algorithm started with the same assumption as Yonik abou the first byte: you can exit after only a read when no more bytes follow. The I tried to think of a way to reduce the number of bitwise operations. This line of thinking was based on the code of cardinality() of java.util.BitSet, which uses special bit patterns to count the bits in a long as fast as possible. This is the algorithm I ended up with:
int i = readByte();
if (i < 0)
i ^= readByte() << 7;
if (i >= 0)
i ^= readByte() << 14;
if (i < 0)
i ^= readByte() << 21;
if (i >= 0)
i ^= readByte() << 28;
return i ^ 0x0FE03F80;
return i ^ 0xFFE03F80;
return i ^ 0x00003F80;
return i ^ 0xFFFFFF80;
In the tests all values returned by algorithms are verified to ensure an optimization does not cause invaliid values to be returned.
The statistics of the test are:
- 6 sets of data with 38,4 million integers per set: random size, 1-byte, 2-byte, 3-byte, 4-byte and 5-byte;
- 1 warming up and 3 individually measured loops per data set;
- per loop each algorithm reads the data set 5 times consecutively.
In total each algorithm reads 38,4 million integers * 6 data sets * 4 loops * 5 reads = over 4 billion integers per test run. Over 18 billion integers are read in the total test.
The tests are run in a single JVM for all algorithms, data sets, loops, etc.
I ran the tests on 7 JVMs. Some JVMs have a client and a server variant, resulting in 12 JVM configurations:
- BEA JDK 1.4.2 client;
- BEA JDK 1.4.2 server;
- BEA JDK 1.5.0 client;
- BEA JDK 1.5.0 server;
- IBM JRE 1.4.2;
- IBM JDK 1.5.0;
- Sun JDK 1.4.2 client;
- Sun JDK 1.4.2 server;
- Sun JDK 1.5.0 client;
- Sun JDK 1.5.0 server;
- Sun JDK 1.6.0b2 client;
- Sun JDK 1.6.0b2 server.
I disconnected all networks on the laptop I ran the tests on. As litle services as possible were running and the firewall was also shutdown. The only applications that were running were Eclipse and the test. The screen saver and energy saving were disabled.
Here is the configuration used to run the tests:
- Intel Pentium M 1.73 GHz;
- 1 GB RAM;
- Windows XP Pro SP 2;
- Lucene trunk;
- Eclipse 3.1.2 Java compiler.
Please note that during the running of the tests some disk I/O was observed. Even though I believe that no swapping would be needed, swapping might have occurred.
After running all the tests I:
- removed the slowest result of each algorithm for each JVM and dataset combination;
- averaged the two remaining results;
- divided all results by the time of the current algorithm;
- selected the 'best' algorithms for each JVM.
The best algorithm is defined as: ((1-byte time - best 1-byte time) + (2-byte time - best 2-byte time)) < 0,02. This means that the total difference between the best 1-byte and 2-byte times cannot be greater than 2 % to be considered the best.
Here are the results, best algorithm first. See the attached PDF for more information:
- exclusive or;
- combine on exit (Yonik's one);
Across all JVMs the exclusive or algorithm saves 3,5-4,8 %.
Yonik already mentioned this at the URL above, a lot depends on the JVM used to run the tests. You should do tests with the JVM on which you are going to deploy. A colleague of mine also mentioned that the compiler used could influence the results. Unfortunately I do not have the time to test all combinations of compilers and JVMs.