Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9027

SIMD-based decoding of postings lists

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 8.4
    • None
    • None
    • New

    Description

      rcmuir has been mentioning the idea for quite some time that we might be able to write the decoding logic in such a way that Java would use SIMD instructions. More recently paul.masurel wrote a blog post that raises the point that Lucene could still do decode multiple ints at once in a single instruction by packing two ints in a long and we've had some discussions about what we could try in Lucene to speed up the decoding of postings. This made me want to look a bit deeper at what we could do.

      Our current decoding logic reads data in a byte[] and decodes packed integers from it. Unfortunately it doesn't make use of SIMD instructions and looks like this.

      I confirmed by looking at the generated assembly that if I take an array of integers and shift them all by the same number of bits then Java will use SIMD instructions to shift multiple integers at once. This led me to writing this implementation that tries as much as possible to shift long sequences of ints by the same number of bits to speed up decoding. It is indeed faster than the current logic we have, up to about 2x faster for some numbers of bits per value.

      Currently the best implementation I've been able to come up with combines the above idea with the idea that Paul mentioned in his blog that consists of emulating SIMD from Java by packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a bit harder to read but gives another speedup on top of the above implementation.

      I have a JMH benchmark available in case someone would like to play with this and maybe even come up with an even faster implementation. It is 2-2.5x faster than our current implementation for most numbers of bits per value. I'm copying results here:

       * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves as
         a baseline.
       * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
         current Lucene codec does.
       * `decodeNaiveFromLongs` decodes from longs on the fly.
       * `decodeSimpleSIMD` is a simple implementation that relies on how Java
         recognizes some patterns and uses SIMD instructions.
       * `decodeSIMD` is a more complex implementation that both relies on the C2
         compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
         2 ints in a long in order to decompress multiple values at once.
      
      Benchmark                                       (bitsPerValue)  (byteOrder)   Mode  Cnt   Score   Error   Units
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           LE  thrpt    5  12.912 ± 0.393  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           BE  thrpt    5  12.862 ± 0.395  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           LE  thrpt    5  13.040 ± 1.162  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           BE  thrpt    5  13.027 ± 0.270  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           LE  thrpt    5  12.409 ± 0.637  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           BE  thrpt    5  12.268 ± 0.947  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           LE  thrpt    5  14.177 ± 2.263  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           BE  thrpt    5  11.457 ± 0.150  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           LE  thrpt    5  10.988 ± 1.179  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           BE  thrpt    5  11.226 ± 0.088  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           LE  thrpt    5   9.791 ± 0.305  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           BE  thrpt    5   9.403 ± 3.598  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           LE  thrpt    5  10.256 ± 0.211  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           BE  thrpt    5  10.314 ± 0.382  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           LE  thrpt    5  16.516 ± 0.380  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           BE  thrpt    5  16.375 ± 0.427  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           LE  thrpt    5   9.067 ± 0.066  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           BE  thrpt    5   9.078 ± 0.178  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           LE  thrpt    5   8.913 ± 0.074  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           BE  thrpt    5   8.893 ± 0.101  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           LE  thrpt    5   7.908 ± 0.118  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           BE  thrpt    5   7.864 ± 0.097  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           LE  thrpt    5   9.220 ± 0.103  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           BE  thrpt    5   9.186 ± 0.241  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           LE  thrpt    5   7.119 ± 0.071  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           BE  thrpt    5   7.066 ± 0.059  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           LE  thrpt    5  12.483 ± 0.171  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           BE  thrpt    5  12.473 ± 0.117  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           LE  thrpt    5   6.202 ± 0.192  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           BE  thrpt    5   6.187 ± 0.399  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           LE  thrpt    5  12.798 ± 0.249  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           BE  thrpt    5  12.987 ± 0.208  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           LE  thrpt    5   7.248 ± 0.096  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           BE  thrpt    5   7.292 ± 0.114  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           LE  thrpt    5   8.923 ± 0.099  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           BE  thrpt    5   8.899 ± 0.028  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           LE  thrpt    5   9.192 ± 0.082  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           BE  thrpt    5   9.090 ± 0.066  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           LE  thrpt    5   7.947 ± 0.039  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           BE  thrpt    5   7.809 ± 0.298  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           LE  thrpt    5   8.342 ± 0.568  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           BE  thrpt    5   8.259 ± 0.572  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           LE  thrpt    5  15.594 ± 0.149  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           BE  thrpt    5  14.012 ± 0.160  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           LE  thrpt    5  12.686 ± 0.271  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           BE  thrpt    5  12.806 ± 0.160  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           LE  thrpt    5  13.571 ± 0.135  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           BE  thrpt    5  13.312 ± 0.110  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           LE  thrpt    5  11.812 ± 0.108  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           BE  thrpt    5  12.874 ± 0.168  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           LE  thrpt    5  12.882 ± 0.114  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           BE  thrpt    5  12.142 ± 0.091  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           LE  thrpt    5  12.302 ± 0.111  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           BE  thrpt    5  10.762 ± 0.250  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           LE  thrpt    5  12.505 ± 0.070  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           BE  thrpt    5  12.149 ± 0.083  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           LE  thrpt    5  11.159 ± 0.341  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           BE  thrpt    5  10.395 ± 0.222  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           LE  thrpt    5  11.004 ± 0.058  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           BE  thrpt    5  10.312 ± 0.369  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           LE  thrpt    5  11.236 ± 0.117  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           BE  thrpt    5   9.792 ± 0.202  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           LE  thrpt    5  10.607 ± 0.105  ops/us
      PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           BE  thrpt    5  10.340 ± 0.070  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         1           LE  thrpt    5  20.925 ± 0.368  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         1           BE  thrpt    5  13.396 ± 0.485  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         2           LE  thrpt    5  20.628 ± 0.494  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         2           BE  thrpt    5  13.584 ± 0.194  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         3           LE  thrpt    5  19.932 ± 1.609  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         3           BE  thrpt    5  13.296 ± 0.095  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         4           LE  thrpt    5  21.065 ± 0.767  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         4           BE  thrpt    5  13.557 ± 0.051  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         5           LE  thrpt    5  19.630 ± 0.067  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         5           BE  thrpt    5  12.916 ± 0.186  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         6           LE  thrpt    5  20.253 ± 0.701  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         6           BE  thrpt    5  12.820 ± 0.048  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         7           LE  thrpt    5  18.944 ± 0.160  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         7           BE  thrpt    5  12.562 ± 0.128  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         8           LE  thrpt    5  22.778 ± 2.023  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         8           BE  thrpt    5  13.658 ± 0.158  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         9           LE  thrpt    5  18.527 ± 0.169  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                         9           BE  thrpt    5  12.045 ± 0.111  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        10           LE  thrpt    5  16.610 ± 0.997  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        10           BE  thrpt    5  11.208 ± 0.087  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        11           LE  thrpt    5  17.961 ± 0.080  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        11           BE  thrpt    5  11.594 ± 0.084  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        12           LE  thrpt    5  16.980 ± 2.372  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        12           BE  thrpt    5  11.135 ± 0.050  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        13           LE  thrpt    5  17.592 ± 0.269  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        13           BE  thrpt    5  11.132 ± 0.227  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        14           LE  thrpt    5  16.964 ± 0.423  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        14           BE  thrpt    5  10.953 ± 0.326  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        15           LE  thrpt    5  17.972 ± 0.572  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        15           BE  thrpt    5  10.872 ± 0.150  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        16           LE  thrpt    5  24.152 ± 0.213  ops/us
      PackedIntsDecodeBenchmark.decodeSIMD                        16           BE  thrpt    5  12.984 ± 0.348  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           LE  thrpt    5  14.567 ± 0.714  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           BE  thrpt    5  10.541 ± 0.079  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           LE  thrpt    5  15.395 ± 0.687  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           BE  thrpt    5  11.142 ± 0.052  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           LE  thrpt    5  15.802 ± 0.623  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           BE  thrpt    5  10.656 ± 0.278  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           LE  thrpt    5  17.732 ± 0.276  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           BE  thrpt    5  11.289 ± 0.209  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           LE  thrpt    5  16.230 ± 0.389  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           BE  thrpt    5  10.216 ± 0.184  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           LE  thrpt    5  16.478 ± 0.682  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           BE  thrpt    5  10.379 ± 0.157  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           LE  thrpt    5  18.222 ± 0.388  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           BE  thrpt    5  11.153 ± 0.619  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           LE  thrpt    5  15.138 ± 0.321  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           BE  thrpt    5   9.384 ± 0.671  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           LE  thrpt    5  20.776 ± 0.397  ops/us
      PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           BE  thrpt    5  10.199 ± 0.146  ops/us
      PackedIntsDecodeBenchmark.readLongs                          1           LE  thrpt    5  30.220 ± 0.652  ops/us
      PackedIntsDecodeBenchmark.readLongs                          1           BE  thrpt    5  16.324 ± 0.226  ops/us
      PackedIntsDecodeBenchmark.readLongs                          2           LE  thrpt    5  30.952 ± 0.329  ops/us
      PackedIntsDecodeBenchmark.readLongs                          2           BE  thrpt    5  16.492 ± 0.397  ops/us
      PackedIntsDecodeBenchmark.readLongs                          3           LE  thrpt    5  30.156 ± 0.979  ops/us
      PackedIntsDecodeBenchmark.readLongs                          3           BE  thrpt    5  16.273 ± 0.441  ops/us
      PackedIntsDecodeBenchmark.readLongs                          4           LE  thrpt    5  29.925 ± 0.718  ops/us
      PackedIntsDecodeBenchmark.readLongs                          4           BE  thrpt    5  15.930 ± 0.350  ops/us
      PackedIntsDecodeBenchmark.readLongs                          5           LE  thrpt    5  29.773 ± 0.979  ops/us
      PackedIntsDecodeBenchmark.readLongs                          5           BE  thrpt    5  15.775 ± 0.257  ops/us
      PackedIntsDecodeBenchmark.readLongs                          6           LE  thrpt    5  29.591 ± 1.285  ops/us
      PackedIntsDecodeBenchmark.readLongs                          6           BE  thrpt    5  15.732 ± 0.226  ops/us
      PackedIntsDecodeBenchmark.readLongs                          7           LE  thrpt    5  29.708 ± 0.909  ops/us
      PackedIntsDecodeBenchmark.readLongs                          7           BE  thrpt    5  15.433 ± 0.562  ops/us
      PackedIntsDecodeBenchmark.readLongs                          8           LE  thrpt    5  29.828 ± 0.689  ops/us
      PackedIntsDecodeBenchmark.readLongs                          8           BE  thrpt    5  15.390 ± 0.188  ops/us
      PackedIntsDecodeBenchmark.readLongs                          9           LE  thrpt    5  29.127 ± 0.309  ops/us
      PackedIntsDecodeBenchmark.readLongs                          9           BE  thrpt    5  15.180 ± 0.199  ops/us
      PackedIntsDecodeBenchmark.readLongs                         10           LE  thrpt    5  29.085 ± 0.538  ops/us
      PackedIntsDecodeBenchmark.readLongs                         10           BE  thrpt    5  14.887 ± 1.687  ops/us
      PackedIntsDecodeBenchmark.readLongs                         11           LE  thrpt    5  28.904 ± 0.329  ops/us
      PackedIntsDecodeBenchmark.readLongs                         11           BE  thrpt    5  14.936 ± 0.119  ops/us
      PackedIntsDecodeBenchmark.readLongs                         12           LE  thrpt    5  29.025 ± 0.299  ops/us
      PackedIntsDecodeBenchmark.readLongs                         12           BE  thrpt    5  14.685 ± 0.154  ops/us
      PackedIntsDecodeBenchmark.readLongs                         13           LE  thrpt    5  28.963 ± 0.244  ops/us
      PackedIntsDecodeBenchmark.readLongs                         13           BE  thrpt    5  14.569 ± 0.100  ops/us
      PackedIntsDecodeBenchmark.readLongs                         14           LE  thrpt    5  28.584 ± 1.409  ops/us
      PackedIntsDecodeBenchmark.readLongs                         14           BE  thrpt    5  14.340 ± 0.594  ops/us
      PackedIntsDecodeBenchmark.readLongs                         15           LE  thrpt    5  28.744 ± 0.314  ops/us
      PackedIntsDecodeBenchmark.readLongs                         15           BE  thrpt    5  14.222 ± 0.105  ops/us
      PackedIntsDecodeBenchmark.readLongs                         16           LE  thrpt    5  26.638 ± 0.452  ops/us
      PackedIntsDecodeBenchmark.readLongs                         16           BE  thrpt    5  13.906 ± 0.604  ops/us
      

      The thing that is a bit frustrating is that the best throughputs are obtained on a ByteBuffer that is configured to use the little endian byte order (which is the native byte order of my machine) while Java/Lucene default to big endian. So if we want that kind of throughput we'll probably need to add ways to read data in the native byte order in the IndexInput API.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jpountz Adrien Grand
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m