Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
rcmuir has been mentioning the idea for quite some time that we might be able to write the decoding logic in such a way that Java would use SIMD instructions. More recently paul.masurel wrote a blog post that raises the point that Lucene could still do decode multiple ints at once in a single instruction by packing two ints in a long and we've had some discussions about what we could try in Lucene to speed up the decoding of postings. This made me want to look a bit deeper at what we could do.
Our current decoding logic reads data in a byte[] and decodes packed integers from it. Unfortunately it doesn't make use of SIMD instructions and looks like this.
I confirmed by looking at the generated assembly that if I take an array of integers and shift them all by the same number of bits then Java will use SIMD instructions to shift multiple integers at once. This led me to writing this implementation that tries as much as possible to shift long sequences of ints by the same number of bits to speed up decoding. It is indeed faster than the current logic we have, up to about 2x faster for some numbers of bits per value.
Currently the best implementation I've been able to come up with combines the above idea with the idea that Paul mentioned in his blog that consists of emulating SIMD from Java by packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a bit harder to read but gives another speedup on top of the above implementation.
I have a JMH benchmark available in case someone would like to play with this and maybe even come up with an even faster implementation. It is 2-2.5x faster than our current implementation for most numbers of bits per value. I'm copying results here:
* `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves as a baseline. * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the current Lucene codec does. * `decodeNaiveFromLongs` decodes from longs on the fly. * `decodeSimpleSIMD` is a simple implementation that relies on how Java recognizes some patterns and uses SIMD instructions. * `decodeSIMD` is a more complex implementation that both relies on the C2 compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or 2 ints in a long in order to decompress multiple values at once. Benchmark (bitsPerValue) (byteOrder) Mode Cnt Score Error Units PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE thrpt 5 12.912 ± 0.393 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE thrpt 5 12.862 ± 0.395 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE thrpt 5 13.040 ± 1.162 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE thrpt 5 13.027 ± 0.270 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE thrpt 5 12.409 ± 0.637 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE thrpt 5 12.268 ± 0.947 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE thrpt 5 14.177 ± 2.263 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE thrpt 5 11.457 ± 0.150 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE thrpt 5 10.988 ± 1.179 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE thrpt 5 11.226 ± 0.088 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 LE thrpt 5 9.791 ± 0.305 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 BE thrpt 5 9.403 ± 3.598 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 7 LE thrpt 5 10.256 ± 0.211 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 7 BE thrpt 5 10.314 ± 0.382 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 8 LE thrpt 5 16.516 ± 0.380 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 8 BE thrpt 5 16.375 ± 0.427 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 9 LE thrpt 5 9.067 ± 0.066 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 9 BE thrpt 5 9.078 ± 0.178 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 10 LE thrpt 5 8.913 ± 0.074 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 10 BE thrpt 5 8.893 ± 0.101 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 11 LE thrpt 5 7.908 ± 0.118 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 11 BE thrpt 5 7.864 ± 0.097 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 12 LE thrpt 5 9.220 ± 0.103 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 12 BE thrpt 5 9.186 ± 0.241 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 13 LE thrpt 5 7.119 ± 0.071 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 13 BE thrpt 5 7.066 ± 0.059 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 14 LE thrpt 5 12.483 ± 0.171 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 14 BE thrpt 5 12.473 ± 0.117 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 15 LE thrpt 5 6.202 ± 0.192 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 15 BE thrpt 5 6.187 ± 0.399 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 16 LE thrpt 5 12.798 ± 0.249 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromBytes 16 BE thrpt 5 12.987 ± 0.208 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 1 LE thrpt 5 7.248 ± 0.096 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 1 BE thrpt 5 7.292 ± 0.114 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 2 LE thrpt 5 8.923 ± 0.099 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 2 BE thrpt 5 8.899 ± 0.028 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 3 LE thrpt 5 9.192 ± 0.082 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 3 BE thrpt 5 9.090 ± 0.066 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 4 LE thrpt 5 7.947 ± 0.039 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 4 BE thrpt 5 7.809 ± 0.298 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 5 LE thrpt 5 8.342 ± 0.568 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 5 BE thrpt 5 8.259 ± 0.572 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 6 LE thrpt 5 15.594 ± 0.149 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 6 BE thrpt 5 14.012 ± 0.160 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 7 LE thrpt 5 12.686 ± 0.271 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 7 BE thrpt 5 12.806 ± 0.160 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 8 LE thrpt 5 13.571 ± 0.135 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 8 BE thrpt 5 13.312 ± 0.110 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 9 LE thrpt 5 11.812 ± 0.108 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 9 BE thrpt 5 12.874 ± 0.168 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 10 LE thrpt 5 12.882 ± 0.114 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 10 BE thrpt 5 12.142 ± 0.091 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 11 LE thrpt 5 12.302 ± 0.111 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 11 BE thrpt 5 10.762 ± 0.250 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 12 LE thrpt 5 12.505 ± 0.070 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 12 BE thrpt 5 12.149 ± 0.083 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 13 LE thrpt 5 11.159 ± 0.341 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 13 BE thrpt 5 10.395 ± 0.222 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 14 LE thrpt 5 11.004 ± 0.058 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 14 BE thrpt 5 10.312 ± 0.369 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 15 LE thrpt 5 11.236 ± 0.117 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 15 BE thrpt 5 9.792 ± 0.202 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 16 LE thrpt 5 10.607 ± 0.105 ops/us PackedIntsDecodeBenchmark.decodeNaiveFromLongs 16 BE thrpt 5 10.340 ± 0.070 ops/us PackedIntsDecodeBenchmark.decodeSIMD 1 LE thrpt 5 20.925 ± 0.368 ops/us PackedIntsDecodeBenchmark.decodeSIMD 1 BE thrpt 5 13.396 ± 0.485 ops/us PackedIntsDecodeBenchmark.decodeSIMD 2 LE thrpt 5 20.628 ± 0.494 ops/us PackedIntsDecodeBenchmark.decodeSIMD 2 BE thrpt 5 13.584 ± 0.194 ops/us PackedIntsDecodeBenchmark.decodeSIMD 3 LE thrpt 5 19.932 ± 1.609 ops/us PackedIntsDecodeBenchmark.decodeSIMD 3 BE thrpt 5 13.296 ± 0.095 ops/us PackedIntsDecodeBenchmark.decodeSIMD 4 LE thrpt 5 21.065 ± 0.767 ops/us PackedIntsDecodeBenchmark.decodeSIMD 4 BE thrpt 5 13.557 ± 0.051 ops/us PackedIntsDecodeBenchmark.decodeSIMD 5 LE thrpt 5 19.630 ± 0.067 ops/us PackedIntsDecodeBenchmark.decodeSIMD 5 BE thrpt 5 12.916 ± 0.186 ops/us PackedIntsDecodeBenchmark.decodeSIMD 6 LE thrpt 5 20.253 ± 0.701 ops/us PackedIntsDecodeBenchmark.decodeSIMD 6 BE thrpt 5 12.820 ± 0.048 ops/us PackedIntsDecodeBenchmark.decodeSIMD 7 LE thrpt 5 18.944 ± 0.160 ops/us PackedIntsDecodeBenchmark.decodeSIMD 7 BE thrpt 5 12.562 ± 0.128 ops/us PackedIntsDecodeBenchmark.decodeSIMD 8 LE thrpt 5 22.778 ± 2.023 ops/us PackedIntsDecodeBenchmark.decodeSIMD 8 BE thrpt 5 13.658 ± 0.158 ops/us PackedIntsDecodeBenchmark.decodeSIMD 9 LE thrpt 5 18.527 ± 0.169 ops/us PackedIntsDecodeBenchmark.decodeSIMD 9 BE thrpt 5 12.045 ± 0.111 ops/us PackedIntsDecodeBenchmark.decodeSIMD 10 LE thrpt 5 16.610 ± 0.997 ops/us PackedIntsDecodeBenchmark.decodeSIMD 10 BE thrpt 5 11.208 ± 0.087 ops/us PackedIntsDecodeBenchmark.decodeSIMD 11 LE thrpt 5 17.961 ± 0.080 ops/us PackedIntsDecodeBenchmark.decodeSIMD 11 BE thrpt 5 11.594 ± 0.084 ops/us PackedIntsDecodeBenchmark.decodeSIMD 12 LE thrpt 5 16.980 ± 2.372 ops/us PackedIntsDecodeBenchmark.decodeSIMD 12 BE thrpt 5 11.135 ± 0.050 ops/us PackedIntsDecodeBenchmark.decodeSIMD 13 LE thrpt 5 17.592 ± 0.269 ops/us PackedIntsDecodeBenchmark.decodeSIMD 13 BE thrpt 5 11.132 ± 0.227 ops/us PackedIntsDecodeBenchmark.decodeSIMD 14 LE thrpt 5 16.964 ± 0.423 ops/us PackedIntsDecodeBenchmark.decodeSIMD 14 BE thrpt 5 10.953 ± 0.326 ops/us PackedIntsDecodeBenchmark.decodeSIMD 15 LE thrpt 5 17.972 ± 0.572 ops/us PackedIntsDecodeBenchmark.decodeSIMD 15 BE thrpt 5 10.872 ± 0.150 ops/us PackedIntsDecodeBenchmark.decodeSIMD 16 LE thrpt 5 24.152 ± 0.213 ops/us PackedIntsDecodeBenchmark.decodeSIMD 16 BE thrpt 5 12.984 ± 0.348 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 1 LE thrpt 5 14.567 ± 0.714 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 1 BE thrpt 5 10.541 ± 0.079 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 2 LE thrpt 5 15.395 ± 0.687 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 2 BE thrpt 5 11.142 ± 0.052 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 3 LE thrpt 5 15.802 ± 0.623 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 3 BE thrpt 5 10.656 ± 0.278 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 4 LE thrpt 5 17.732 ± 0.276 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 4 BE thrpt 5 11.289 ± 0.209 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 5 LE thrpt 5 16.230 ± 0.389 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 5 BE thrpt 5 10.216 ± 0.184 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 6 LE thrpt 5 16.478 ± 0.682 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 6 BE thrpt 5 10.379 ± 0.157 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 8 LE thrpt 5 18.222 ± 0.388 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 8 BE thrpt 5 11.153 ± 0.619 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 10 LE thrpt 5 15.138 ± 0.321 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 10 BE thrpt 5 9.384 ± 0.671 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 16 LE thrpt 5 20.776 ± 0.397 ops/us PackedIntsDecodeBenchmark.decodeSimpleSIMD 16 BE thrpt 5 10.199 ± 0.146 ops/us PackedIntsDecodeBenchmark.readLongs 1 LE thrpt 5 30.220 ± 0.652 ops/us PackedIntsDecodeBenchmark.readLongs 1 BE thrpt 5 16.324 ± 0.226 ops/us PackedIntsDecodeBenchmark.readLongs 2 LE thrpt 5 30.952 ± 0.329 ops/us PackedIntsDecodeBenchmark.readLongs 2 BE thrpt 5 16.492 ± 0.397 ops/us PackedIntsDecodeBenchmark.readLongs 3 LE thrpt 5 30.156 ± 0.979 ops/us PackedIntsDecodeBenchmark.readLongs 3 BE thrpt 5 16.273 ± 0.441 ops/us PackedIntsDecodeBenchmark.readLongs 4 LE thrpt 5 29.925 ± 0.718 ops/us PackedIntsDecodeBenchmark.readLongs 4 BE thrpt 5 15.930 ± 0.350 ops/us PackedIntsDecodeBenchmark.readLongs 5 LE thrpt 5 29.773 ± 0.979 ops/us PackedIntsDecodeBenchmark.readLongs 5 BE thrpt 5 15.775 ± 0.257 ops/us PackedIntsDecodeBenchmark.readLongs 6 LE thrpt 5 29.591 ± 1.285 ops/us PackedIntsDecodeBenchmark.readLongs 6 BE thrpt 5 15.732 ± 0.226 ops/us PackedIntsDecodeBenchmark.readLongs 7 LE thrpt 5 29.708 ± 0.909 ops/us PackedIntsDecodeBenchmark.readLongs 7 BE thrpt 5 15.433 ± 0.562 ops/us PackedIntsDecodeBenchmark.readLongs 8 LE thrpt 5 29.828 ± 0.689 ops/us PackedIntsDecodeBenchmark.readLongs 8 BE thrpt 5 15.390 ± 0.188 ops/us PackedIntsDecodeBenchmark.readLongs 9 LE thrpt 5 29.127 ± 0.309 ops/us PackedIntsDecodeBenchmark.readLongs 9 BE thrpt 5 15.180 ± 0.199 ops/us PackedIntsDecodeBenchmark.readLongs 10 LE thrpt 5 29.085 ± 0.538 ops/us PackedIntsDecodeBenchmark.readLongs 10 BE thrpt 5 14.887 ± 1.687 ops/us PackedIntsDecodeBenchmark.readLongs 11 LE thrpt 5 28.904 ± 0.329 ops/us PackedIntsDecodeBenchmark.readLongs 11 BE thrpt 5 14.936 ± 0.119 ops/us PackedIntsDecodeBenchmark.readLongs 12 LE thrpt 5 29.025 ± 0.299 ops/us PackedIntsDecodeBenchmark.readLongs 12 BE thrpt 5 14.685 ± 0.154 ops/us PackedIntsDecodeBenchmark.readLongs 13 LE thrpt 5 28.963 ± 0.244 ops/us PackedIntsDecodeBenchmark.readLongs 13 BE thrpt 5 14.569 ± 0.100 ops/us PackedIntsDecodeBenchmark.readLongs 14 LE thrpt 5 28.584 ± 1.409 ops/us PackedIntsDecodeBenchmark.readLongs 14 BE thrpt 5 14.340 ± 0.594 ops/us PackedIntsDecodeBenchmark.readLongs 15 LE thrpt 5 28.744 ± 0.314 ops/us PackedIntsDecodeBenchmark.readLongs 15 BE thrpt 5 14.222 ± 0.105 ops/us PackedIntsDecodeBenchmark.readLongs 16 LE thrpt 5 26.638 ± 0.452 ops/us PackedIntsDecodeBenchmark.readLongs 16 BE thrpt 5 13.906 ± 0.604 ops/us
The thing that is a bit frustrating is that the best throughputs are obtained on a ByteBuffer that is configured to use the little endian byte order (which is the native byte order of my machine) while Java/Lucene default to big endian. So if we want that kind of throughput we'll probably need to add ways to read data in the native byte order in the IndexInput API.
Attachments
Issue Links
- links to