Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: Bulk Postings branch
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Implementation of Patched Frame of Reference.

      1. TestPFor2.java
        4 kB
        Michael McCandless
      2. TestPFor2.java
        5 kB
        Michael McCandless
      3. TestPFor2.java
        8 kB
        Michael McCandless
      4. TermQueryTests.tgz
        352 kB
        Michael McCandless
      5. LUCENE-1410e.patch
        78 kB
        Paul Elschot
      6. LUCENE-1410d.patch
        59 kB
        Paul Elschot
      7. LUCENE-1410-codecs.tar.bz2
        15 kB
        Michael McCandless
      8. LUCENE-1410c.patch
        51 kB
        Paul Elschot
      9. LUCENE-1410b.patch
        64 kB
        Paul Elschot
      10. LUCENE-1410.patch
        251 kB
        Michael McCandless
      11. LUCENE-1410.patch
        287 kB
        Michael McCandless
      12. LUCENE-1410.patch
        41 kB
        hao yan
      13. LUCENE-1410.patch
        37 kB
        Michael McCandless
      14. for-summary.txt
        1 kB
        Renaud Delbru
      15. autogen.tgz
        129 kB
        Michael McCandless

        Activity

        Hide
        Paul Elschot added a comment -

        20081001: Initial implementation of PFOR.

        The target package is o.a.l.util.pfor, mostly for testing convenience,
        even though the code has no dependencies on Lucene.

        To try it out please use jvmarg -server during the test, see also TestPFor.java on how to do this. The command
        ant -Dtestcase=TestPFor test-core
        should start the test after applying the patch.
        The test will take about 25 seconds to run.

        There is some optimization for decompression included. On my machine
        with a 1.6.0_03-b05 Sun jvm, the decompression performance for 1-7 bits
        frame size varies between about 60M ints/sec unoptimized and 200M ints/sec
        optimized, as reported by the test. This appears adequate for practical use,
        but it should be noted that this performance is from CPU cache to CPU cache.

        The implementation still needs quite a bit of work. I'm posting it now because
        I'd like feedback on the interface for compression and decompression.
        Typical intended usage is present in TestPFor.java.

        Show
        Paul Elschot added a comment - 20081001: Initial implementation of PFOR. The target package is o.a.l.util.pfor, mostly for testing convenience, even though the code has no dependencies on Lucene. To try it out please use jvmarg -server during the test, see also TestPFor.java on how to do this. The command ant -Dtestcase=TestPFor test-core should start the test after applying the patch. The test will take about 25 seconds to run. There is some optimization for decompression included. On my machine with a 1.6.0_03-b05 Sun jvm, the decompression performance for 1-7 bits frame size varies between about 60M ints/sec unoptimized and 200M ints/sec optimized, as reported by the test. This appears adequate for practical use, but it should be noted that this performance is from CPU cache to CPU cache. The implementation still needs quite a bit of work. I'm posting it now because I'd like feedback on the interface for compression and decompression. Typical intended usage is present in TestPFor.java.
        Hide
        Michael McCandless added a comment -

        Awesome, I'll have a look!

        The TestPFor.java didn't make it into the patch (it just references a link).

        Show
        Michael McCandless added a comment - Awesome, I'll have a look! The TestPFor.java didn't make it into the patch (it just references a link).
        Hide
        Paul Elschot added a comment -

        Sorry about the mistake in the patch, a correction will follow shortly.

        Show
        Paul Elschot added a comment - Sorry about the mistake in the patch, a correction will follow shortly.
        Hide
        Paul Elschot added a comment -

        A corrected b.patch including TestPFor.java

        Show
        Paul Elschot added a comment - A corrected b.patch including TestPFor.java
        Hide
        Paul Elschot added a comment -

        Another detail about the decompression performance as reported above: these are all cases without exceptions, so an expected performance degradation for patching exceptions is not included in the performance results.
        Patching exceptions is provided for in the code, but that code is not yet optimized.

        Show
        Paul Elschot added a comment - Another detail about the decompression performance as reported above: these are all cases without exceptions, so an expected performance degradation for patching exceptions is not included in the performance results. Patching exceptions is provided for in the code, but that code is not yet optimized.
        Hide
        Michael McCandless added a comment -

        Paul, I'm eager to test pfor on real Lucene vInts.

        So I created a simple test program (attached TestPFor2.java). Run it
        like this:

          Usage: java org.apache.lucene.util.pfor.TestPFor2 <indexDirName> <vIntFileNameIn> <pForFileNameOut>
          
          Eg: java org.apache.lucene.util.pfor.TestPFor2 /lucene/index _l.prx _l.prx.pfor
        

        where indexDirName is the directory of a Lucene index, vIntFileNameIn
        should be a file that just has a bunch of vInts (Lucene's *.frq and
        *.prx fit this) and pForFileNameOut is a file it writes with blocks
        encoded in PFor.

        It first encodes the vInts from vIntFileNameIn into pfor blocks
        written to pForFileNameOut. Then it measures decode time of reading
        all vInts from vIntFileNameIn vs decode time of reading all pfor
        blocks. It runs each round 5 times.

        The test has certain serious flaws:

        • Can you add a method that figures out the right frame size to use
          for a given block of ints (so that ~90% of the ints are < N bits)?
        • I'm using fixed 6-bit frame size. Can you add bigger bit sizes to
          your pfor decompress?

        With these fixes the test would be more fair to pfor.

        In the PFor file that I write, I simply write an int (# bytes)
        followed by the bytes, for each block. Then when reading these blocks
        I read the #bytes, then read into the byte array backing the intArray
        passed to the PFor for decompress. In a real integration I think
        writing the int #bytes should be unecessary (is pfor self
        puncuating?).

        This is inefficient because in doing this for real we should avoid the
        double-copy of the byte[]. In fact, we might push it even lower
        (under IndexInput, eg, create a IntBlockIndexInput) to possibly avoid
        the copy into byte[] in BufferedIndexInput by maybe using direct
        ByteBuffers from the OS. So even once we fix the top two issues
        above, the results of a "real" integration should be still faster.

        I ran this on a 622 MB prx file from a Wikipedia index, and even with
        the above 2 limitations it's still a good amount faster:

        java org.apache.lucene.util.pfor.TestPFor2 /lucene/big _l.prx _l.prx.pfor
        
        encode _l.prx to _l.prx.pfor...
        442979072 vints; 888027375 bytes compressed vs orig size 651670377
        
        decompress using pfor:
        4198 msec to decode 442979072 vInts (105521 vInts/msec)
        4332 msec to decode 442979072 vInts (102257 vInts/msec)
        4165 msec to decode 442979072 vInts (106357 vInts/msec)
        4122 msec to decode 442979072 vInts (107467 vInts/msec)
        4061 msec to decode 442979072 vInts (109081 vInts/msec)
        
        decompress using readVInt:
        7315 msec to read 442979104 vInts (60557 vInts/msec)
        7390 msec to read 442979104 vInts (59943 vInts/msec)
        5816 msec to read 442979104 vInts (76165 vInts/msec)
        5937 msec to read 442979104 vInts (74613 vInts/msec)
        5970 msec to read 442979104 vInts (74200 vInts/msec)
        

        It's really weird how the time gets suddenly faster during readVInt.
        It's very repeatable. on another machine I see it get suddenly slower
        starting at the same (3rd) round. Adding -server and -Xbatch doesn't
        change this behavior. This is with (build 1.6.0_10-rc-b28) on Linux
        and (build 1.6.0_05-b13-120) on Mac OS X 10.5.5.

        Show
        Michael McCandless added a comment - Paul, I'm eager to test pfor on real Lucene vInts. So I created a simple test program (attached TestPFor2.java). Run it like this: Usage: java org.apache.lucene.util.pfor.TestPFor2 <indexDirName> <vIntFileNameIn> <pForFileNameOut> Eg: java org.apache.lucene.util.pfor.TestPFor2 /lucene/index _l.prx _l.prx.pfor where indexDirName is the directory of a Lucene index, vIntFileNameIn should be a file that just has a bunch of vInts (Lucene's *.frq and *.prx fit this) and pForFileNameOut is a file it writes with blocks encoded in PFor. It first encodes the vInts from vIntFileNameIn into pfor blocks written to pForFileNameOut. Then it measures decode time of reading all vInts from vIntFileNameIn vs decode time of reading all pfor blocks. It runs each round 5 times. The test has certain serious flaws: Can you add a method that figures out the right frame size to use for a given block of ints (so that ~90% of the ints are < N bits)? I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress? With these fixes the test would be more fair to pfor. In the PFor file that I write, I simply write an int (# bytes) followed by the bytes, for each block. Then when reading these blocks I read the #bytes, then read into the byte array backing the intArray passed to the PFor for decompress. In a real integration I think writing the int #bytes should be unecessary (is pfor self puncuating?). This is inefficient because in doing this for real we should avoid the double-copy of the byte[]. In fact, we might push it even lower (under IndexInput, eg, create a IntBlockIndexInput) to possibly avoid the copy into byte[] in BufferedIndexInput by maybe using direct ByteBuffers from the OS. So even once we fix the top two issues above, the results of a "real" integration should be still faster. I ran this on a 622 MB prx file from a Wikipedia index, and even with the above 2 limitations it's still a good amount faster: java org.apache.lucene.util.pfor.TestPFor2 /lucene/big _l.prx _l.prx.pfor encode _l.prx to _l.prx.pfor... 442979072 vints; 888027375 bytes compressed vs orig size 651670377 decompress using pfor: 4198 msec to decode 442979072 vInts (105521 vInts/msec) 4332 msec to decode 442979072 vInts (102257 vInts/msec) 4165 msec to decode 442979072 vInts (106357 vInts/msec) 4122 msec to decode 442979072 vInts (107467 vInts/msec) 4061 msec to decode 442979072 vInts (109081 vInts/msec) decompress using readVInt: 7315 msec to read 442979104 vInts (60557 vInts/msec) 7390 msec to read 442979104 vInts (59943 vInts/msec) 5816 msec to read 442979104 vInts (76165 vInts/msec) 5937 msec to read 442979104 vInts (74613 vInts/msec) 5970 msec to read 442979104 vInts (74200 vInts/msec) It's really weird how the time gets suddenly faster during readVInt. It's very repeatable. on another machine I see it get suddenly slower starting at the same (3rd) round. Adding -server and -Xbatch doesn't change this behavior. This is with (build 1.6.0_10-rc-b28) on Linux and (build 1.6.0_05-b13-120) on Mac OS X 10.5.5.
        Hide
        Paul Elschot added a comment -

        Q: Can you add a method that figures out the right frame size to use
        for a given block of ints (so that ~90% of the ints are < N bits)?
        A: PFor.getNumFrameBits() does this for a given int[] at offset and size.
        Choosing the right size is a dilemma: too small will need too much header decoding
        and too large may result in using too large number of frame bits, i.e. too large compressed size.

        Q: I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress?
        A: It should work for 1<=numFrameBits<=30.

        Q: is pfor self punctuating?
        A: PFor.bufferByteSize() returns the number of bytes used in the compressed buffer, see also the javadocs.
        For practical use, the start of each compressed block must be known, either from somewhere else,
        or from the size of the previously encoded block.
        The number of compressed integers is encoded in the header, but I'm not sure whether I
        made that info available before decompression to allow allocation of an int[] that is large
        enough to hold the decompressed data.

        >: It's really weird how the time gets suddenly faster during readVInt.
        A: it's probably the JIT jumping in. That's why I preferred to test in 3 1-second loops and reporting
        performance each second. The 2nd second always has better performance.

        It's nice to see that PFor is faster than VInt, I had not tested that yet.
        Which block size did you use for PFor? Never mind, I'll take a look at the code of TestPFor2.

        Btw. after decompressing, the things are ints not vInts.

        Show
        Paul Elschot added a comment - Q: Can you add a method that figures out the right frame size to use for a given block of ints (so that ~90% of the ints are < N bits)? A: PFor.getNumFrameBits() does this for a given int[] at offset and size. Choosing the right size is a dilemma: too small will need too much header decoding and too large may result in using too large number of frame bits, i.e. too large compressed size. Q: I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress? A: It should work for 1<=numFrameBits<=30. Q: is pfor self punctuating? A: PFor.bufferByteSize() returns the number of bytes used in the compressed buffer, see also the javadocs. For practical use, the start of each compressed block must be known, either from somewhere else, or from the size of the previously encoded block. The number of compressed integers is encoded in the header, but I'm not sure whether I made that info available before decompression to allow allocation of an int[] that is large enough to hold the decompressed data. >: It's really weird how the time gets suddenly faster during readVInt. A: it's probably the JIT jumping in. That's why I preferred to test in 3 1-second loops and reporting performance each second. The 2nd second always has better performance. It's nice to see that PFor is faster than VInt, I had not tested that yet. Which block size did you use for PFor? Never mind, I'll take a look at the code of TestPFor2. Btw. after decompressing, the things are ints not vInts.
        Hide
        Paul Elschot added a comment -

        Q: I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress?
        A: It should work for 1<=numFrameBits<=30.

        That answer was too short. There is some fallback code (decodeAnyFrame) that will decompress
        for any frame or reference. The patch contains unrolled versions of that for up to 7 bits iirc.
        I'll add higher numbers of bits later, the code is straightforward and it could actually be generated.
        Btw. on my machine the unrolled 9 bit version is actually a bit slower than the loop, I don't know why,
        maybe there are too many buffer references (9) in the loop for the jit to cope with.

        Show
        Paul Elschot added a comment - Q: I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress? A: It should work for 1<=numFrameBits<=30. That answer was too short. There is some fallback code (decodeAnyFrame) that will decompress for any frame or reference. The patch contains unrolled versions of that for up to 7 bits iirc. I'll add higher numbers of bits later, the code is straightforward and it could actually be generated. Btw. on my machine the unrolled 9 bit version is actually a bit slower than the loop, I don't know why, maybe there are too many buffer references (9) in the loop for the jit to cope with.
        Hide
        Michael McCandless added a comment -

        A: PFor.getNumFrameBits() does this for a given int[] at offset and size.

        Super, I missed that – I'll use it.

        Btw. after decompressing, the things are ints not vInts.

        Oh yeah, I'll fix the prints in my test.

        There is some fallback code (decodeAnyFrame) that will decompress for any frame or reference

        Right I was wanting the unrolled version to see the fastest result we can get for pfor, but your next observation is spooky so maybe this isn't really going to help our test:

        Btw. on my machine the unrolled 9 bit version is actually a bit slower than the loop, I don't know why,
        maybe there are too many buffer references (9) in the loop for the jit to cope with.

        We should look at the asm that's produced?

        it's probably the JIT jumping in.

        But strangely with -Xbatch I still see the effect. And even stranger, on another machine (Mac OS X) it gets slower when the JIT jumps in. I'm spooked.

        Which block size did you use for PFor?

        I used 128 but I'll try different sizes.

        For practical use, the start of each compressed block must be known, either from somewhere else, or from the size of the previously encoded block.

        But can I also get the "bytes consumed" when I ask decompress() to
        run?

        When we really integrate, things like term infos and skipping data
        will know how to jump to the start of blocks. But for raw sequential
        scanning, if the header is self-punctuating we would not need to store
        the block size between each block.

        Show
        Michael McCandless added a comment - A: PFor.getNumFrameBits() does this for a given int[] at offset and size. Super, I missed that – I'll use it. Btw. after decompressing, the things are ints not vInts. Oh yeah, I'll fix the prints in my test. There is some fallback code (decodeAnyFrame) that will decompress for any frame or reference Right I was wanting the unrolled version to see the fastest result we can get for pfor, but your next observation is spooky so maybe this isn't really going to help our test: Btw. on my machine the unrolled 9 bit version is actually a bit slower than the loop, I don't know why, maybe there are too many buffer references (9) in the loop for the jit to cope with. We should look at the asm that's produced? it's probably the JIT jumping in. But strangely with -Xbatch I still see the effect. And even stranger, on another machine (Mac OS X) it gets slower when the JIT jumps in. I'm spooked. Which block size did you use for PFor? I used 128 but I'll try different sizes. For practical use, the start of each compressed block must be known, either from somewhere else, or from the size of the previously encoded block. But can I also get the "bytes consumed" when I ask decompress() to run? When we really integrate, things like term infos and skipping data will know how to jump to the start of blocks. But for raw sequential scanning, if the header is self-punctuating we would not need to store the block size between each block.
        Hide
        Otis Gospodnetic added a comment -

        For people not intimately familiar with PFOR (like me), I found the following to be helpful:
        http://cis.poly.edu/cs912/indexcomp.pdf

        Show
        Otis Gospodnetic added a comment - For people not intimately familiar with PFOR (like me), I found the following to be helpful: http://cis.poly.edu/cs912/indexcomp.pdf
        Hide
        Michael McCandless added a comment -

        There's also this recent comparison of index compression approaches: http://www2008.org/papers/pdf/p387-zhangA.pdf and http://homepages.cwi.nl/~heman/downloads/msthesis.pdf.

        Show
        Michael McCandless added a comment - There's also this recent comparison of index compression approaches: http://www2008.org/papers/pdf/p387-zhangA.pdf and http://homepages.cwi.nl/~heman/downloads/msthesis.pdf .
        Hide
        Paul Elschot added a comment -

        Q: We should look at the asm that's produced?
        A: Maybe. But would that be java bytecodes or x86 or powerpc? I'd prefer to wait with that.
        There are some nice machine instrns to this on the various architectures, but that would require to use native code.

        Q: But can I also get the "bytes consumed" when I ask decompress() to run?
        A; Yes, but only after decompression or after compression.
        The number of exceptions is not explicitly coded in the header, so the size to encode the exceptions is not known beforehand. That could be changed, but than the header will get bigger.
        (For compression, it's possible to do a run without buffer first.)

        Block size 128 should be fine, one could also try 64 and 32.

        Thanks for posting the links here, they are the ones I used to code this up. (Does that count as homework? )
        I did not put the links in the code because of possible link rot. The article titles and authors are in the javadocs.
        Currently the easy way to find the links is via google scholar.

        Show
        Paul Elschot added a comment - Q: We should look at the asm that's produced? A: Maybe. But would that be java bytecodes or x86 or powerpc? I'd prefer to wait with that. There are some nice machine instrns to this on the various architectures, but that would require to use native code. Q: But can I also get the "bytes consumed" when I ask decompress() to run? A; Yes, but only after decompression or after compression. The number of exceptions is not explicitly coded in the header, so the size to encode the exceptions is not known beforehand. That could be changed, but than the header will get bigger. (For compression, it's possible to do a run without buffer first.) Block size 128 should be fine, one could also try 64 and 32. Thanks for posting the links here, they are the ones I used to code this up. (Does that count as homework? ) I did not put the links in the code because of possible link rot. The article titles and authors are in the javadocs. Currently the easy way to find the links is via google scholar.
        Hide
        Michael McCandless added a comment -

        New TestPFor2 attached. I changed vInt -> int in the prints and now
        compute "best" bit size per-block using getNumFrameBits() and use that
        per block.

        I took a big frq file and computed %tg for each #bits:

        bits count pct
        1 0 0.0
        2 24328 0.9
        3 94689 3.7
        4 213887 8.4
        5 277510 10.8
        6 284905 11.1
        7 193081 7.5
        8 262857 10.3
        9 260460 10.2
        10 212046 8.3
        11 162872 6.4
        12 109801 4.3
        13 77366 3.0
        14 56620 2.2
        15 34294 1.3
        16 31000 1.2
        17 28803 1.1
        18 21617 0.8
        19 22317 0.9
        20 30326 1.2
        21 162663 6.4

        And for prx:

        bits count pct
        1 23034 0.7
        2 12637 0.4
        3 49286 1.4
        4 56344 1.6
        5 69385 2.0
        6 81964 2.4
        7 108847 3.1
        8 179296 5.2
        9 428787 12.4
        10 873828 25.2
        11 957913 27.7
        12 534426 15.4
        13 81856 2.4
        14 2436 0.1
        15 474 0.0
        16 200 0.0
        17 43 0.0
        18 17 0.0
        19 1 0.0

        It's interesting that prx is so much more tightly clustered than frq.

        New results for decoding vInt vs pfor, for frq file:

        327864576 ints; 431137397 bytes compressed vs orig size 447077047
        
        decompress using pfor:
        2514 msec to decode 327864576 ints (130415 ints/msec)
        2171 msec to decode 327864576 ints (151020 ints/msec)
        2137 msec to decode 327864576 ints (153422 ints/msec)
        2148 msec to decode 327864576 ints (152637 ints/msec)
        2067 msec to decode 327864576 ints (158618 ints/msec)
        
        decompress using readVInt:
        4549 msec to read 327864691 ints (72074 ints/msec)
        4421 msec to read 327864691 ints (74160 ints/msec)
        3240 msec to read 327864691 ints (101192 ints/msec)
        3199 msec to read 327864691 ints (102489 ints/msec)
        3323 msec to read 327864691 ints (98665 ints/msec)
        
        PFor is 54.766% faster
        

        and prx file:

        encode _l.prx to _l.prx.pfor...
        442979072 ints; 628580878 bytes compressed vs orig size 651670377
        decompress using pfor:
        6385 msec to decode 442979072 ints (69378 ints/msec)
        5904 msec to decode 442979072 ints (75030 ints/msec)
        5796 msec to decode 442979072 ints (76428 ints/msec)
        5807 msec to decode 442979072 ints (76283 ints/msec)
        5803 msec to decode 442979072 ints (76336 ints/msec)
        
        decompress using readVInt:
        6893 msec to read 442979104 ints (64265 ints/msec)
        6759 msec to read 442979104 ints (65539 ints/msec)
        5304 msec to read 442979104 ints (83517 ints/msec)
        5275 msec to read 442979104 ints (83977 ints/msec)
        5792 msec to read 442979104 ints (76481 ints/msec)
        
        PFor is 8.989% slower
        

        Some comments:

        • In both cases, the pfor file a bit smaller than the original
          Lucene file. And, it's unfair because I inject the 4 extra bytes
          per block (which I think won't be needed "for real").
        • For frq file, pfor is quite a bit faster (54.8%) but for prx file
          it's slower (9.0%). Maybe it's because prx file has much higher
          occurrence of bigger bit sizes (where we don't have unrolled
          version)? It's odd; maybe I'm doing something wrong. With a
          fixed 6 bits it is quite a bit faster (I retested to verify!)
          which is also weird because there would be many exceptions.
          Exception processing ought to be slower than non-exception
          processing!

        Stepping back a bit: I wonder what %tg of the time in a "typical"
        Lucene search is spent decoding vInts? That would bound how much
        improvement we could ever expect from this optimization during
        searching.

        Show
        Michael McCandless added a comment - New TestPFor2 attached. I changed vInt -> int in the prints and now compute "best" bit size per-block using getNumFrameBits() and use that per block. I took a big frq file and computed %tg for each #bits: bits count pct 1 0 0.0 2 24328 0.9 3 94689 3.7 4 213887 8.4 5 277510 10.8 6 284905 11.1 7 193081 7.5 8 262857 10.3 9 260460 10.2 10 212046 8.3 11 162872 6.4 12 109801 4.3 13 77366 3.0 14 56620 2.2 15 34294 1.3 16 31000 1.2 17 28803 1.1 18 21617 0.8 19 22317 0.9 20 30326 1.2 21 162663 6.4 And for prx: bits count pct 1 23034 0.7 2 12637 0.4 3 49286 1.4 4 56344 1.6 5 69385 2.0 6 81964 2.4 7 108847 3.1 8 179296 5.2 9 428787 12.4 10 873828 25.2 11 957913 27.7 12 534426 15.4 13 81856 2.4 14 2436 0.1 15 474 0.0 16 200 0.0 17 43 0.0 18 17 0.0 19 1 0.0 It's interesting that prx is so much more tightly clustered than frq. New results for decoding vInt vs pfor, for frq file: 327864576 ints; 431137397 bytes compressed vs orig size 447077047 decompress using pfor: 2514 msec to decode 327864576 ints (130415 ints/msec) 2171 msec to decode 327864576 ints (151020 ints/msec) 2137 msec to decode 327864576 ints (153422 ints/msec) 2148 msec to decode 327864576 ints (152637 ints/msec) 2067 msec to decode 327864576 ints (158618 ints/msec) decompress using readVInt: 4549 msec to read 327864691 ints (72074 ints/msec) 4421 msec to read 327864691 ints (74160 ints/msec) 3240 msec to read 327864691 ints (101192 ints/msec) 3199 msec to read 327864691 ints (102489 ints/msec) 3323 msec to read 327864691 ints (98665 ints/msec) PFor is 54.766% faster and prx file: encode _l.prx to _l.prx.pfor... 442979072 ints; 628580878 bytes compressed vs orig size 651670377 decompress using pfor: 6385 msec to decode 442979072 ints (69378 ints/msec) 5904 msec to decode 442979072 ints (75030 ints/msec) 5796 msec to decode 442979072 ints (76428 ints/msec) 5807 msec to decode 442979072 ints (76283 ints/msec) 5803 msec to decode 442979072 ints (76336 ints/msec) decompress using readVInt: 6893 msec to read 442979104 ints (64265 ints/msec) 6759 msec to read 442979104 ints (65539 ints/msec) 5304 msec to read 442979104 ints (83517 ints/msec) 5275 msec to read 442979104 ints (83977 ints/msec) 5792 msec to read 442979104 ints (76481 ints/msec) PFor is 8.989% slower Some comments: In both cases, the pfor file a bit smaller than the original Lucene file. And, it's unfair because I inject the 4 extra bytes per block (which I think won't be needed "for real"). For frq file, pfor is quite a bit faster (54.8%) but for prx file it's slower (9.0%). Maybe it's because prx file has much higher occurrence of bigger bit sizes (where we don't have unrolled version)? It's odd; maybe I'm doing something wrong. With a fixed 6 bits it is quite a bit faster (I retested to verify!) which is also weird because there would be many exceptions. Exception processing ought to be slower than non-exception processing! Stepping back a bit: I wonder what %tg of the time in a "typical" Lucene search is spent decoding vInts? That would bound how much improvement we could ever expect from this optimization during searching.
        Hide
        Michael McCandless added a comment -

        Q: We should look at the asm that's produced?
        A: Maybe. But would that be java bytecodes or x86 or powerpc? I'd prefer to wait with that.
        There are some nice machine instrns to this on the various architectures, but that would require to use native code.

        I was thinking local CPU's native asm, so that we could at least see if the benefits of pfor (no hard-for-cpu-to-predict if statement inside inner loop) "survive" through the Java stack. I'll try to look.

        Q: But can I also get the "bytes consumed" when I ask decompress() to run?
        A; Yes, but only after decompression or after compression.
        The number of exceptions is not explicitly coded in the header, so the size to encode the exceptions is not known beforehand. That could be changed, but than the header will get bigger.

        OK so it is self-punctuating, so, we wouldn't need to encode block size in bytes into the file.

        Show
        Michael McCandless added a comment - Q: We should look at the asm that's produced? A: Maybe. But would that be java bytecodes or x86 or powerpc? I'd prefer to wait with that. There are some nice machine instrns to this on the various architectures, but that would require to use native code. I was thinking local CPU's native asm, so that we could at least see if the benefits of pfor (no hard-for-cpu-to-predict if statement inside inner loop) "survive" through the Java stack. I'll try to look. Q: But can I also get the "bytes consumed" when I ask decompress() to run? A; Yes, but only after decompression or after compression. The number of exceptions is not explicitly coded in the header, so the size to encode the exceptions is not known beforehand. That could be changed, but than the header will get bigger. OK so it is self-punctuating, so, we wouldn't need to encode block size in bytes into the file.
        Hide
        Paul Elschot added a comment -

        The number of bits is not really informative, it would be better to have a distribution of the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might be tried.

        As for decompression speed, please remember that the patching code (that decodes the exceptions into the output) has not yet been optimized at all.

        The lucene .frq file contains the docids as deltas and the frequencies but with a special encoding in case the frequency is one. I'd rather try compressing the real delta docids and the real frequencies than the encoded versions.

        The .prx file should be useable as it is, and from the results reported in the articles I would expect a real performance advantage for PFor for that.
        Michael, could you post a distribution of the number of frame bits for the .prx file? I'd like to know a reasonable maximum for unrolling the corresponding loops.

        >: maybe I'm doing something wrong
        I don't think so, the code is still young. Try running TestPFor and have a look at the output near the end for the case of 6 frame bits. That should show the unrolled decoding speed for the case without exceptions. Then compare that to the cases with lower and higher nrs of frame bits.

        >: Stepping back a bit: I wonder what %tg of the time in a "typical"
        Lucene search is spent decoding vInts? That would bound how much
        improvement we could ever expect from this optimization during
        searching.

        A: there is also the influence of the reduction of data to be fetched (via various caches) from the index. As reported in the articles, PFor strikes a nice optimum between decompression speed and fetching speed from (fast) disks.

        >: I was thinking local CPU's native asm.
        A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a while for me that I used C.

        For the record, to show the variation in decompression speeds for different numbers of frame bits without exceptions, here is my current output from TestPFor:

        test901PerfFor1Decompress starting
        Compressed 128 bytes into 8, ratio 0.0625
        test901PerfFor1Decompress 0 numFrameBits 1 decompressed 104857600 in 1021 msecs, 102700 ints/msec, (25 iters).
        test901PerfFor1Decompress 1 numFrameBits 1 decompressed 171966464 in 1020 msecs, 168594 ints/msec, (41 iters).
        test901PerfFor1Decompress 2 numFrameBits 1 decompressed 171966464 in 1012 msecs, 169927 ints/msec, (41 iters).
        
        test902PerfFor2Decompress starting
        Compressed 128 bytes into 12, ratio 0.09375
        test902PerfFor2Decompress 0 numFrameBits 2 decompressed 130023424 in 1017 msecs, 127849 ints/msec, (31 iters).
        test902PerfFor2Decompress 1 numFrameBits 2 decompressed 155189248 in 1000 msecs, 155189 ints/msec, (37 iters).
        test902PerfFor2Decompress 2 numFrameBits 2 decompressed 159383552 in 1022 msecs, 155952 ints/msec, (38 iters).
        
        test903PerfFor3Decompress starting
        Compressed 128 bytes into 16, ratio 0.125
        test903PerfFor3Decompress 0 numFrameBits 3 decompressed 188743680 in 1016 msecs, 185771 ints/msec, (45 iters).
        test903PerfFor3Decompress 1 numFrameBits 3 decompressed 205520896 in 1018 msecs, 201886 ints/msec, (49 iters).
        test903PerfFor3Decompress 2 numFrameBits 3 decompressed 201326592 in 1026 msecs, 196224 ints/msec, (48 iters).
        
        test904PerfFor4Decompress starting
        Compressed 128 bytes into 20, ratio 0.15625
        test904PerfFor4Decompress 0 numFrameBits 4 decompressed 146800640 in 1014 msecs, 144773 ints/msec, (35 iters).
        test904PerfFor4Decompress 1 numFrameBits 4 decompressed 159383552 in 1007 msecs, 158275 ints/msec, (38 iters).
        test904PerfFor4Decompress 2 numFrameBits 4 decompressed 159383552 in 1011 msecs, 157649 ints/msec, (38 iters).
        
        test905PerfFor5Decompress starting
        Compressed 128 bytes into 24, ratio 0.1875
        test905PerfFor5Decompress 0 numFrameBits 5 decompressed 159383552 in 1010 msecs, 157805 ints/msec, (38 iters).
        test905PerfFor5Decompress 1 numFrameBits 5 decompressed 176160768 in 1009 msecs, 174589 ints/msec, (42 iters).
        test905PerfFor5Decompress 2 numFrameBits 5 decompressed 176160768 in 1004 msecs, 175458 ints/msec, (42 iters).
        
        test906PerfFor6Decompress starting
        Compressed 128 bytes into 28, ratio 0.21875
        test906PerfFor6Decompress 0 numFrameBits 6 decompressed 117440512 in 1001 msecs, 117323 ints/msec, (28 iters).
        test906PerfFor6Decompress 1 numFrameBits 6 decompressed 125829120 in 1002 msecs, 125577 ints/msec, (30 iters).
        test906PerfFor6Decompress 2 numFrameBits 6 decompressed 125829120 in 1004 msecs, 125327 ints/msec, (30 iters).
        
        test907PerfFor7Decompress starting
        Compressed 128 bytes into 32, ratio 0.25
        test907PerfFor7Decompress 0 numFrameBits 7 decompressed 125829120 in 1016 msecs, 123847 ints/msec, (30 iters).
        test907PerfFor7Decompress 1 numFrameBits 7 decompressed 150994944 in 1015 msecs, 148763 ints/msec, (36 iters).
        test907PerfFor7Decompress 2 numFrameBits 7 decompressed 150994944 in 1012 msecs, 149204 ints/msec, (36 iters).
        
        test908PerfFor8Decompress starting
        Compressed 124 bytes into 35, ratio 0.28225806
        test908PerfFor8Decompress 0 numFrameBits 8 decompressed 85327872 in 1015 msecs, 84066 ints/msec, (21 iters).
        test908PerfFor8Decompress 1 numFrameBits 8 decompressed 121896960 in 1021 msecs, 119389 ints/msec, (30 iters).
        test908PerfFor8Decompress 2 numFrameBits 8 decompressed 121896960 in 1022 msecs, 119272 ints/msec, (30 iters).
        
        test909PerfFor9Decompress starting
        Compressed 124 bytes into 39, ratio 0.31451613
        test909PerfFor9Decompress 0 numFrameBits 9 decompressed 52822016 in 1016 msecs, 51990 ints/msec, (13 iters).
        test909PerfFor9Decompress 1 numFrameBits 9 decompressed 56885248 in 1028 msecs, 55335 ints/msec, (14 iters).
        test909PerfFor9Decompress 2 numFrameBits 9 decompressed 56885248 in 1029 msecs, 55282 ints/msec, (14 iters).
        

        The loop for 9 bits is not unrolled, unrolling makes it a tiny little bit (55->54 Mint/sec) slower.

        Show
        Paul Elschot added a comment - The number of bits is not really informative, it would be better to have a distribution of the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might be tried. As for decompression speed, please remember that the patching code (that decodes the exceptions into the output) has not yet been optimized at all. The lucene .frq file contains the docids as deltas and the frequencies but with a special encoding in case the frequency is one. I'd rather try compressing the real delta docids and the real frequencies than the encoded versions. The .prx file should be useable as it is, and from the results reported in the articles I would expect a real performance advantage for PFor for that. Michael, could you post a distribution of the number of frame bits for the .prx file? I'd like to know a reasonable maximum for unrolling the corresponding loops. >: maybe I'm doing something wrong I don't think so, the code is still young. Try running TestPFor and have a look at the output near the end for the case of 6 frame bits. That should show the unrolled decoding speed for the case without exceptions. Then compare that to the cases with lower and higher nrs of frame bits. >: Stepping back a bit: I wonder what %tg of the time in a "typical" Lucene search is spent decoding vInts? That would bound how much improvement we could ever expect from this optimization during searching. A: there is also the influence of the reduction of data to be fetched (via various caches) from the index. As reported in the articles, PFor strikes a nice optimum between decompression speed and fetching speed from (fast) disks. >: I was thinking local CPU's native asm. A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a while for me that I used C. For the record, to show the variation in decompression speeds for different numbers of frame bits without exceptions, here is my current output from TestPFor: test901PerfFor1Decompress starting Compressed 128 bytes into 8, ratio 0.0625 test901PerfFor1Decompress 0 numFrameBits 1 decompressed 104857600 in 1021 msecs, 102700 ints/msec, (25 iters). test901PerfFor1Decompress 1 numFrameBits 1 decompressed 171966464 in 1020 msecs, 168594 ints/msec, (41 iters). test901PerfFor1Decompress 2 numFrameBits 1 decompressed 171966464 in 1012 msecs, 169927 ints/msec, (41 iters). test902PerfFor2Decompress starting Compressed 128 bytes into 12, ratio 0.09375 test902PerfFor2Decompress 0 numFrameBits 2 decompressed 130023424 in 1017 msecs, 127849 ints/msec, (31 iters). test902PerfFor2Decompress 1 numFrameBits 2 decompressed 155189248 in 1000 msecs, 155189 ints/msec, (37 iters). test902PerfFor2Decompress 2 numFrameBits 2 decompressed 159383552 in 1022 msecs, 155952 ints/msec, (38 iters). test903PerfFor3Decompress starting Compressed 128 bytes into 16, ratio 0.125 test903PerfFor3Decompress 0 numFrameBits 3 decompressed 188743680 in 1016 msecs, 185771 ints/msec, (45 iters). test903PerfFor3Decompress 1 numFrameBits 3 decompressed 205520896 in 1018 msecs, 201886 ints/msec, (49 iters). test903PerfFor3Decompress 2 numFrameBits 3 decompressed 201326592 in 1026 msecs, 196224 ints/msec, (48 iters). test904PerfFor4Decompress starting Compressed 128 bytes into 20, ratio 0.15625 test904PerfFor4Decompress 0 numFrameBits 4 decompressed 146800640 in 1014 msecs, 144773 ints/msec, (35 iters). test904PerfFor4Decompress 1 numFrameBits 4 decompressed 159383552 in 1007 msecs, 158275 ints/msec, (38 iters). test904PerfFor4Decompress 2 numFrameBits 4 decompressed 159383552 in 1011 msecs, 157649 ints/msec, (38 iters). test905PerfFor5Decompress starting Compressed 128 bytes into 24, ratio 0.1875 test905PerfFor5Decompress 0 numFrameBits 5 decompressed 159383552 in 1010 msecs, 157805 ints/msec, (38 iters). test905PerfFor5Decompress 1 numFrameBits 5 decompressed 176160768 in 1009 msecs, 174589 ints/msec, (42 iters). test905PerfFor5Decompress 2 numFrameBits 5 decompressed 176160768 in 1004 msecs, 175458 ints/msec, (42 iters). test906PerfFor6Decompress starting Compressed 128 bytes into 28, ratio 0.21875 test906PerfFor6Decompress 0 numFrameBits 6 decompressed 117440512 in 1001 msecs, 117323 ints/msec, (28 iters). test906PerfFor6Decompress 1 numFrameBits 6 decompressed 125829120 in 1002 msecs, 125577 ints/msec, (30 iters). test906PerfFor6Decompress 2 numFrameBits 6 decompressed 125829120 in 1004 msecs, 125327 ints/msec, (30 iters). test907PerfFor7Decompress starting Compressed 128 bytes into 32, ratio 0.25 test907PerfFor7Decompress 0 numFrameBits 7 decompressed 125829120 in 1016 msecs, 123847 ints/msec, (30 iters). test907PerfFor7Decompress 1 numFrameBits 7 decompressed 150994944 in 1015 msecs, 148763 ints/msec, (36 iters). test907PerfFor7Decompress 2 numFrameBits 7 decompressed 150994944 in 1012 msecs, 149204 ints/msec, (36 iters). test908PerfFor8Decompress starting Compressed 124 bytes into 35, ratio 0.28225806 test908PerfFor8Decompress 0 numFrameBits 8 decompressed 85327872 in 1015 msecs, 84066 ints/msec, (21 iters). test908PerfFor8Decompress 1 numFrameBits 8 decompressed 121896960 in 1021 msecs, 119389 ints/msec, (30 iters). test908PerfFor8Decompress 2 numFrameBits 8 decompressed 121896960 in 1022 msecs, 119272 ints/msec, (30 iters). test909PerfFor9Decompress starting Compressed 124 bytes into 39, ratio 0.31451613 test909PerfFor9Decompress 0 numFrameBits 9 decompressed 52822016 in 1016 msecs, 51990 ints/msec, (13 iters). test909PerfFor9Decompress 1 numFrameBits 9 decompressed 56885248 in 1028 msecs, 55335 ints/msec, (14 iters). test909PerfFor9Decompress 2 numFrameBits 9 decompressed 56885248 in 1029 msecs, 55282 ints/msec, (14 iters). The loop for 9 bits is not unrolled, unrolling makes it a tiny little bit (55->54 Mint/sec) slower.
        Hide
        Paul Elschot added a comment -

        I wrote: The number of bits is not really informative, it would be better to have a distribution of the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might be tried.

        But I misread what was meant by the number of bits, actually the number of frame bits that I requested is already listed there, so I'll have a go at unrolling for larger numbers of frame bits, especially for the .prx data.

        Show
        Paul Elschot added a comment - I wrote: The number of bits is not really informative, it would be better to have a distribution of the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might be tried. But I misread what was meant by the number of bits, actually the number of frame bits that I requested is already listed there, so I'll have a go at unrolling for larger numbers of frame bits, especially for the .prx data.
        Hide
        Michael McCandless added a comment -

        As for decompression speed, please remember that the patching code (that decodes the exceptions into the output) has not yet been optimized at all.

        But this is what I find so weird: for prx, if I fix the encoding at 6 bits, which generates a zillion exceptions, we are ~31% faster than decoding vInts, and the pfor file size is much bigger (847 MB vs 621 MB) than vInt. But if instead I use the bit size as returned by getNumFrameBits(), which has far fewer exceptions, we are 9.0% slower and file size is a bit smaller than vInt. Exception processing seems to be very fast, or, it's the non-unrolled (ForDecompress.decodeAnyFrame) that's slow which could very well be.

        The lucene .frq file contains the docids as deltas and the frequencies but with a special encoding in case the frequency is one. I'd rather try compressing the real delta docids and the real frequencies than the encoded versions.

        I'll try that. I bet if we had two separate streams (one for the docIDs and one for the corresponding freqs) we'd get even better pFor compression. If we did that "for real" in Lucene that'd also make it fast to use a normally-indexed field for pure boolean searching (you wouldn't have to index 2 different fields as you do today to gain the performance at search time). In fact, for AND queries on a normal Lucene index this should also result in faster searching since when searching for the common docIDs you at first don't need the freq data.

        Marvin has been doing some performance testing recently and seems to have concluded that keeping prx and frq as separate files (like Lucene does today but KinoSearch does not) gives better performance. Pushing that that same trend further, I think it may very well make sense to separate docID and frq as well.

        A: there is also the influence of the reduction of data to be fetched (via various caches) from the index. As reported in the articles, PFor strikes a nice optimum between decompression speed and fetching speed from (fast) disks.

        True, but we are seeing just a bit of compression vs Lucene's current encoding. I think splitting out frq from docID may show better compression.

        >: I was thinking local CPU's native asm.
        A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a while for me that I used C.

        Well... that would just make me depressed (if from Javaland the CPU
        level benefits of pFor don't really "survive", but from C-land they
        do) But yes I agree.

        For the record, to show the variation in decompression speeds for different numbers of frame bits without exceptions, here is my current output from TestPFor:

        I have similar results (up to 7 bits – can you post your new TestPFor.java?).

        The even-byte sizes (16, 24) have very sizable leads over the others. The 9-bit size is fantastically slow; it's insane that unrolling it isn't helping. Seems like we will need to peek at asm to understand what's going on at the "bare metal" level....

        Show
        Michael McCandless added a comment - As for decompression speed, please remember that the patching code (that decodes the exceptions into the output) has not yet been optimized at all. But this is what I find so weird: for prx, if I fix the encoding at 6 bits, which generates a zillion exceptions, we are ~31% faster than decoding vInts, and the pfor file size is much bigger (847 MB vs 621 MB) than vInt. But if instead I use the bit size as returned by getNumFrameBits(), which has far fewer exceptions, we are 9.0% slower and file size is a bit smaller than vInt. Exception processing seems to be very fast, or, it's the non-unrolled (ForDecompress.decodeAnyFrame) that's slow which could very well be. The lucene .frq file contains the docids as deltas and the frequencies but with a special encoding in case the frequency is one. I'd rather try compressing the real delta docids and the real frequencies than the encoded versions. I'll try that. I bet if we had two separate streams (one for the docIDs and one for the corresponding freqs) we'd get even better pFor compression. If we did that "for real" in Lucene that'd also make it fast to use a normally-indexed field for pure boolean searching (you wouldn't have to index 2 different fields as you do today to gain the performance at search time). In fact, for AND queries on a normal Lucene index this should also result in faster searching since when searching for the common docIDs you at first don't need the freq data. Marvin has been doing some performance testing recently and seems to have concluded that keeping prx and frq as separate files (like Lucene does today but KinoSearch does not) gives better performance. Pushing that that same trend further, I think it may very well make sense to separate docID and frq as well. A: there is also the influence of the reduction of data to be fetched (via various caches) from the index. As reported in the articles, PFor strikes a nice optimum between decompression speed and fetching speed from (fast) disks. True, but we are seeing just a bit of compression vs Lucene's current encoding. I think splitting out frq from docID may show better compression. >: I was thinking local CPU's native asm. A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a while for me that I used C. Well... that would just make me depressed (if from Javaland the CPU level benefits of pFor don't really "survive", but from C-land they do) But yes I agree. For the record, to show the variation in decompression speeds for different numbers of frame bits without exceptions, here is my current output from TestPFor: I have similar results (up to 7 bits – can you post your new TestPFor.java?). The even-byte sizes (16, 24) have very sizable leads over the others. The 9-bit size is fantastically slow; it's insane that unrolling it isn't helping. Seems like we will need to peek at asm to understand what's going on at the "bare metal" level....
        Hide
        Paul Elschot added a comment -

        I'm working on the 10 right now, but I'm hitting another bug, it will take a while.

        The performance of TestPFor2 should be better after running TestPFor for all bit sizes, this could be a good reason to combine them.

        The correctness tests in TestPFor are not really exhaustive, so I'd like to have a simple test for correctness in TestPFor2, for example a running (long) sum of all decompressed ints that should equal before and after.

        To avoid having to peek at the asm level, there is also the possibility to use a slightly higher number of frame bits when we know one number of bits will be slow to decompress.

        Show
        Paul Elschot added a comment - I'm working on the 10 right now, but I'm hitting another bug, it will take a while. The performance of TestPFor2 should be better after running TestPFor for all bit sizes, this could be a good reason to combine them. The correctness tests in TestPFor are not really exhaustive, so I'd like to have a simple test for correctness in TestPFor2, for example a running (long) sum of all decompressed ints that should equal before and after. To avoid having to peek at the asm level, there is also the possibility to use a slightly higher number of frame bits when we know one number of bits will be slow to decompress.
        Hide
        Michael McCandless added a comment -

        Attached new TestPfor2.java:

        • If you run it on _x.frq, it will split that into _x.frq.frq and
          _x.frq.doc so we can separately test frq vs docs
        • Added checksum.
        • Added zeroing of reused IntBuffer before calling PFor.compress
          (seems to be necessary? Else I trip an assert inside PFor).

        Paul, indeed I get a different checksum for vint vs pfor decoding.

        I think the bug is somewhere in pfor, I'm guessing in the exception
        logic, because the difference I see is suddenly pfor returns 0 when it
        should have returned a large int relative to the other ints nearby.

        Maybe this is why exception processing looked so much faster

        I'll hold off posting more perf results until we can resolve that.

        To see the checksum run it with asserts, eg like this:

        java -ea oal.util.pfor.TestPFor2 /path/to/index _x.prx

        It then prints out SUM lines after each iteration.

        If you set DEBUG = true, it'll print the first 1000 values and then
        search for "v=0".

        Show
        Michael McCandless added a comment - Attached new TestPfor2.java: If you run it on _x.frq, it will split that into _x.frq.frq and _x.frq.doc so we can separately test frq vs docs Added checksum. Added zeroing of reused IntBuffer before calling PFor.compress (seems to be necessary? Else I trip an assert inside PFor). Paul, indeed I get a different checksum for vint vs pfor decoding. I think the bug is somewhere in pfor, I'm guessing in the exception logic, because the difference I see is suddenly pfor returns 0 when it should have returned a large int relative to the other ints nearby. Maybe this is why exception processing looked so much faster I'll hold off posting more perf results until we can resolve that. To see the checksum run it with asserts, eg like this: java -ea oal.util.pfor.TestPFor2 /path/to/index _x.prx It then prints out SUM lines after each iteration. If you set DEBUG = true, it'll print the first 1000 values and then search for "v=0".
        Hide
        Michael McCandless added a comment -

        (Attached autogen.tgz).

        I started peeking at the ASM generated by different tweaks to the
        bit decoder methods.

        It's rather depressing: small things, like using relative not absolute
        getInt from ByteBuffer, declaring things final, reading into array
        first instead of getInt per int, make sizable differences in the
        resulting asm and decode speed.

        To try out these various changes to the Java sources, I created a
        Python script to generate the 31 different decode methods. I then
        made a new standalone perf test that reads in a prx file as a series
        of packed N-bit ints, and reports the best of 5 runs.

        These results are NOT testing the full pfor – just different possible
        methods for doing the N-bit int unpacking part of pfor. Paul, I
        haven't integrated these into the pfor code, but I think we probably
        eventually should.

        Finally, I switched the autogen to C++/gcc to see how much faster
        simple C code is.

        In both the java and C tests, by far the fastest way was to mmap the
        file and read ints from it, sequentially (relative gets) using
        ByteBuffer, so all 3 take that approach.

        (To run these, extract autogen.tgz, then open *.py and see the comment
        at the top; you'll have to edit the sources to fix the hardwired path
        to the prx file).

        Java1 code calls getInt() one at a time, like this:

          static void decode4(final ByteBuffer in, final int[] out) {
            final int i1 = in.getInt();
            out[0] = i1 & 15;
            out[1] = (i1 >>> 4) & 15;
            out[2] = (i1 >>> 8) & 15;
            out[3] = (i1 >>> 12) & 15;
            out[4] = (i1 >>> 16) & 15;
            out[5] = (i1 >>> 20) & 15;
            out[6] = (i1 >>> 24) & 15;
            out[7] = (i1 >>> 28) & 15;
            final int i2 = in.getInt();
            out[8] = i2 & 15;
            out[9] = (i2 >>> 4) & 15;
            out[10] = (i2 >>> 8) & 15;
            ...
          }
        

        Java 2 code gets all N ints up front, like this:

          static void decode4(IntBuffer in, int[] inbuf, int[] out) {
            in.get(inbuf, 0, 16);
            out[0] = inbuf[0] & 15;
            out[1] = (inbuf[0] >>> 4) & 15;
            out[2] = (inbuf[0] >>> 8) & 15;
            out[3] = (inbuf[0] >>> 12) & 15;
            out[4] = (inbuf[0] >>> 16) & 15;
            out[5] = (inbuf[0] >>> 20) & 15;
            out[6] = (inbuf[0] >>> 24) & 15;
            out[7] = (inbuf[0] >>> 28) & 15;
            out[8] = inbuf[1] & 15;
            out[9] = (inbuf[1] >>> 4) & 15;
            out[10] = (inbuf[1] >>> 8) & 15;
            ...
          }
        

        C++ code is analogous to Java 1 (data is mmap'd):

        static bool decode4(int* out) {
          int i1 = *(data++);
          *(out++) = i1 & 15;
          *(out++) = (i1 >> 4) & 15;
          *(out++) = (i1 >> 8) & 15;
          *(out++) = (i1 >> 12) & 15;
          *(out++) = (i1 >> 16) & 15;
          *(out++) = (i1 >> 20) & 15;
          *(out++) = (i1 >> 24) & 15;
          *(out++) = (i1 >> 28) & 15;
          int i2 = *(data++);
          *(out++) = i2 & 15;
          ...
        }
        

        Here's the performance for each bit size:

        bits Java 1 (kints/msec) Java 2 (kints/msec) C++ (kints/msec) C advantage
        1 916.6 648.5 1445.3 1.6x
        2 793.8 593.4 1118.3 1.4x
        3 616.7 541.9 1068.8 1.7x
        4 656.6 512.1 1161.6 1.8x
        5 499.8 469.0 897.3 1.8x
        6 410.6 444.9 899.5 2.0x
        7 367.4 409.0 801.7 2.0x
        8 414.0 386.7 816.6 2.0x
        9 306.3 366.9 710.8 1.9x
        10 278.8 341.9 665.8 1.9x
        11 258.1 307.8 623.6 2.0x
        12 245.9 311.7 592.7 1.9x
        13 223.9 285.0 574.5 2.0x
        14 204.2 271.8 538.1 2.0x
        15 190.3 260.0 522.6 2.0x
        16 229.9 250.4 519.7 2.1x
        17 190.8 223.7 488.3 2.2x
        18 171.6 198.1 461.2 2.3x
        19 158.3 180.5 436.2 2.4x
        20 163.1 207.4 416.4 2.0x
        21 156.3 166.3 403.0 2.4x
        22 147.0 163.5 387.8 2.4x
        23 141.6 174.1 357.5 2.1x
        24 141.9 162.0 362.6 2.2x
        25 133.2 177.6 335.3 1.9x
        26 125.8 153.5 334.7 2.2x
        27 121.6 139.6 314.0 2.2x
        28 122.7 130.1 316.7 2.4x
        29 107.3 123.9 296.7 2.4x
        30 111.0 127.6 300.7 2.4x
        31 108.1 94.0 290.5 2.7x

        C code is between 1.4-2.7 X faster than the best Java run. Reading
        one int at a time is faster when the #bits is small <=5), but then
        reading all N ints up front is general faster for larger bit sizes.

        Show
        Michael McCandless added a comment - (Attached autogen.tgz). I started peeking at the ASM generated by different tweaks to the bit decoder methods. It's rather depressing: small things, like using relative not absolute getInt from ByteBuffer, declaring things final, reading into array first instead of getInt per int, make sizable differences in the resulting asm and decode speed. To try out these various changes to the Java sources, I created a Python script to generate the 31 different decode methods. I then made a new standalone perf test that reads in a prx file as a series of packed N-bit ints, and reports the best of 5 runs. These results are NOT testing the full pfor – just different possible methods for doing the N-bit int unpacking part of pfor. Paul, I haven't integrated these into the pfor code, but I think we probably eventually should. Finally, I switched the autogen to C++/gcc to see how much faster simple C code is. In both the java and C tests, by far the fastest way was to mmap the file and read ints from it, sequentially (relative gets) using ByteBuffer, so all 3 take that approach. (To run these, extract autogen.tgz, then open *.py and see the comment at the top; you'll have to edit the sources to fix the hardwired path to the prx file). Java1 code calls getInt() one at a time, like this: static void decode4( final ByteBuffer in, final int [] out) { final int i1 = in.getInt(); out[0] = i1 & 15; out[1] = (i1 >>> 4) & 15; out[2] = (i1 >>> 8) & 15; out[3] = (i1 >>> 12) & 15; out[4] = (i1 >>> 16) & 15; out[5] = (i1 >>> 20) & 15; out[6] = (i1 >>> 24) & 15; out[7] = (i1 >>> 28) & 15; final int i2 = in.getInt(); out[8] = i2 & 15; out[9] = (i2 >>> 4) & 15; out[10] = (i2 >>> 8) & 15; ... } Java 2 code gets all N ints up front, like this: static void decode4(IntBuffer in, int [] inbuf, int [] out) { in.get(inbuf, 0, 16); out[0] = inbuf[0] & 15; out[1] = (inbuf[0] >>> 4) & 15; out[2] = (inbuf[0] >>> 8) & 15; out[3] = (inbuf[0] >>> 12) & 15; out[4] = (inbuf[0] >>> 16) & 15; out[5] = (inbuf[0] >>> 20) & 15; out[6] = (inbuf[0] >>> 24) & 15; out[7] = (inbuf[0] >>> 28) & 15; out[8] = inbuf[1] & 15; out[9] = (inbuf[1] >>> 4) & 15; out[10] = (inbuf[1] >>> 8) & 15; ... } C++ code is analogous to Java 1 (data is mmap'd): static bool decode4( int * out) { int i1 = *(data++); *(out++) = i1 & 15; *(out++) = (i1 >> 4) & 15; *(out++) = (i1 >> 8) & 15; *(out++) = (i1 >> 12) & 15; *(out++) = (i1 >> 16) & 15; *(out++) = (i1 >> 20) & 15; *(out++) = (i1 >> 24) & 15; *(out++) = (i1 >> 28) & 15; int i2 = *(data++); *(out++) = i2 & 15; ... } Here's the performance for each bit size: bits Java 1 (kints/msec) Java 2 (kints/msec) C++ (kints/msec) C advantage 1 916.6 648.5 1445.3 1.6x 2 793.8 593.4 1118.3 1.4x 3 616.7 541.9 1068.8 1.7x 4 656.6 512.1 1161.6 1.8x 5 499.8 469.0 897.3 1.8x 6 410.6 444.9 899.5 2.0x 7 367.4 409.0 801.7 2.0x 8 414.0 386.7 816.6 2.0x 9 306.3 366.9 710.8 1.9x 10 278.8 341.9 665.8 1.9x 11 258.1 307.8 623.6 2.0x 12 245.9 311.7 592.7 1.9x 13 223.9 285.0 574.5 2.0x 14 204.2 271.8 538.1 2.0x 15 190.3 260.0 522.6 2.0x 16 229.9 250.4 519.7 2.1x 17 190.8 223.7 488.3 2.2x 18 171.6 198.1 461.2 2.3x 19 158.3 180.5 436.2 2.4x 20 163.1 207.4 416.4 2.0x 21 156.3 166.3 403.0 2.4x 22 147.0 163.5 387.8 2.4x 23 141.6 174.1 357.5 2.1x 24 141.9 162.0 362.6 2.2x 25 133.2 177.6 335.3 1.9x 26 125.8 153.5 334.7 2.2x 27 121.6 139.6 314.0 2.2x 28 122.7 130.1 316.7 2.4x 29 107.3 123.9 296.7 2.4x 30 111.0 127.6 300.7 2.4x 31 108.1 94.0 290.5 2.7x C code is between 1.4-2.7 X faster than the best Java run. Reading one int at a time is faster when the #bits is small <=5), but then reading all N ints up front is general faster for larger bit sizes.
        Hide
        Paul Elschot added a comment - - edited

        I can't see the .tgz attachment at the moment.

        Very nice table, this java/c++ comparison. Meanwhile I also have a java program generator for decompression, and I got very similar results.

        I found a bug for the 17 bits case in decodeAnyFrame(), this might explain the sum test failure that you saw.

        One detail on the generated code: the final mask can be omitted, for example the last bit extraction for the 5 bit case:

         (i4 >>> 27)
        

        On the java side leaving this last mask out did not make a difference, but it might be noticeable in c++.

        Show
        Paul Elschot added a comment - - edited I can't see the .tgz attachment at the moment. Very nice table, this java/c++ comparison. Meanwhile I also have a java program generator for decompression, and I got very similar results. I found a bug for the 17 bits case in decodeAnyFrame(), this might explain the sum test failure that you saw. One detail on the generated code: the final mask can be omitted, for example the last bit extraction for the 5 bit case: (i4 >>> 27) On the java side leaving this last mask out did not make a difference, but it might be noticeable in c++.
        Hide
        Michael McCandless added a comment -

        Woops I forgot the attachment...

        Show
        Michael McCandless added a comment - Woops I forgot the attachment...
        Hide
        Michael McCandless added a comment -

        One detail on the generated code: the final mask can be omitted, for example the last bit extraction for the 5 bit case:

        Ahh good catch. I'll fix in my autogen.

        Meanwhile I also have a java program generator for decompression, and I got very similar results.

        Excellent! Did you also switch to relative getInts? This way, I could change my TestPFor2 to rely on self-punctuation when reading. And, I can pass down a ByteBuffer derived directly from the file, instead of copying bytes into byte[] first. This would make the test more fair to pfor.

        I found a bug for the 17 bits case in decodeAnyFrame(), this might explain the sum test failure that you saw.

        OK I can re-test when you post this, to see if the checksums match.

        Show
        Michael McCandless added a comment - One detail on the generated code: the final mask can be omitted, for example the last bit extraction for the 5 bit case: Ahh good catch. I'll fix in my autogen. Meanwhile I also have a java program generator for decompression, and I got very similar results. Excellent! Did you also switch to relative getInts? This way, I could change my TestPFor2 to rely on self-punctuation when reading. And, I can pass down a ByteBuffer derived directly from the file, instead of copying bytes into byte[] first. This would make the test more fair to pfor. I found a bug for the 17 bits case in decodeAnyFrame(), this might explain the sum test failure that you saw. OK I can re-test when you post this, to see if the checksums match.
        Hide
        Paul Elschot added a comment -

        Did you also switch to relative getInts?

        No. I thought about that, but I wanted to make it work correctly first.

        This way, I could change my TestPFor2 to rely on self-punctuation when reading. And, I can pass down a ByteBuffer derived directly from the file, instead of copying bytes into byte[] first.

        The only concern I have there is that when using a byte[] directly from the file getting the int values may result in non 4 byte aligned fetches. Can current hardware do this well?

        It's tempting to move to a full C implementation directly now. Should we do that?
        A similar move was made in the past by letting gcc deal with vInts, but meanwhile the jvms caught up there.

        Show
        Paul Elschot added a comment - Did you also switch to relative getInts? No. I thought about that, but I wanted to make it work correctly first. This way, I could change my TestPFor2 to rely on self-punctuation when reading. And, I can pass down a ByteBuffer derived directly from the file, instead of copying bytes into byte[] first. The only concern I have there is that when using a byte[] directly from the file getting the int values may result in non 4 byte aligned fetches. Can current hardware do this well? It's tempting to move to a full C implementation directly now. Should we do that? A similar move was made in the past by letting gcc deal with vInts, but meanwhile the jvms caught up there.
        Hide
        Michael McCandless added a comment -

        The only concern I have there is that when using a byte[] directly from the file getting the int values may result in non 4 byte aligned fetches. Can current hardware do this well?

        Good question, though if that's the case we could presumably work
        around it by ensuring the header of the file is 0 mod 4?

        Or... is this because a full block (header + bits + exception data)
        may not be 0 mod 4? (Though we too could pad if necessary)

        I think the API we want to reach (eventually) is an IntBlockInput in
        Directory where you call read(int[]) and it returns the next 128 (say)
        ints in an array and moves itself to the end of that block (start of
        the next block).

        It's tempting to move to a full C implementation directly now. Should we do that?
        A similar move was made in the past by letting gcc deal with vInts, but meanwhile the jvms caught up there.

        Maybe we should explore this eventually, but I think for now we should
        first try to get the Java version online?

        I'd really love to get a prototype integration working so that we can
        then do an end-to-end performance test (ie, actual searches). I'm still
        wondering how much speedups at this layer will actually affect overall
        search time.

        Show
        Michael McCandless added a comment - The only concern I have there is that when using a byte[] directly from the file getting the int values may result in non 4 byte aligned fetches. Can current hardware do this well? Good question, though if that's the case we could presumably work around it by ensuring the header of the file is 0 mod 4? Or... is this because a full block (header + bits + exception data) may not be 0 mod 4? (Though we too could pad if necessary) I think the API we want to reach (eventually) is an IntBlockInput in Directory where you call read(int[]) and it returns the next 128 (say) ints in an array and moves itself to the end of that block (start of the next block). It's tempting to move to a full C implementation directly now. Should we do that? A similar move was made in the past by letting gcc deal with vInts, but meanwhile the jvms caught up there. Maybe we should explore this eventually, but I think for now we should first try to get the Java version online? I'd really love to get a prototype integration working so that we can then do an end-to-end performance test (ie, actual searches). I'm still wondering how much speedups at this layer will actually affect overall search time.
        Hide
        Paul Elschot added a comment - - edited

        Or... is this because a full block (header + bits + exception data) may not be 0 mod 4? (Though we too could pad if necessary)

        The exceptions can be 1, 2 or 4 bytes and there is one more coding possible, we might reserve that for long.
        I'd like to use some padding space to get the exceptions aligned to their natural border, 2 bytes exceptions at even byte position, and 4 bytes exceptions at (... mod 4 == 0) natural int border. That way the rest of the padding space (if any) could be used to align to natural int border.

        I'd really love to get a prototype integration working so that we can then do an end-to-end performance test (ie, actual searches). I'm still wondering how much speedups at this layer will actually affect overall search time.

        The decoding speeds reported here so far are about the same as the ones reported by Zhang in 2008. That means there is room for disk speeds to catch up with CPU speed. In other words, adding this for proximities (and maybe docids and freqs) should make lucene a natural fit for ssd's, see also ZhangfFig. 15.

        Show
        Paul Elschot added a comment - - edited Or... is this because a full block (header + bits + exception data) may not be 0 mod 4? (Though we too could pad if necessary) The exceptions can be 1, 2 or 4 bytes and there is one more coding possible, we might reserve that for long. I'd like to use some padding space to get the exceptions aligned to their natural border, 2 bytes exceptions at even byte position, and 4 bytes exceptions at (... mod 4 == 0) natural int border. That way the rest of the padding space (if any) could be used to align to natural int border. I'd really love to get a prototype integration working so that we can then do an end-to-end performance test (ie, actual searches). I'm still wondering how much speedups at this layer will actually affect overall search time. The decoding speeds reported here so far are about the same as the ones reported by Zhang in 2008. That means there is room for disk speeds to catch up with CPU speed. In other words, adding this for proximities (and maybe docids and freqs) should make lucene a natural fit for ssd's, see also ZhangfFig. 15.
        Hide
        Michael McCandless added a comment -

        That means there is quite a bit of room for disk speeds to catch up with CPU speed.

        Exactly!

        Search optimization is tricky because for an index that can't fit
        entirely in the OS's IO cache, reducing the CPU cost of searching
        (which we are diong, here) is basically usless: the end user won't see
        much benefit.

        For an index entirely in the IO cache, I think these optimizations
        might make a big difference.

        In some sense, we have been allowed to hide behind the slow
        performance of magnetic hard drives and not worry much about reducing
        the CPU cost of searching.

        However: relatively soon most computers will use SSDs, and then
        suddenly it's as if every index is in the IO cache (albeit a somewhat
        slower one, but still far faster than magnetic media). So now is
        the time for us to reduce the cpu cost of searching for Lucene.

        And this means for this issue and other sources of optimizing search
        performance, we should largely focus only on indices entirely cached
        in the IO cache.

        Show
        Michael McCandless added a comment - That means there is quite a bit of room for disk speeds to catch up with CPU speed. Exactly! Search optimization is tricky because for an index that can't fit entirely in the OS's IO cache, reducing the CPU cost of searching (which we are diong, here) is basically usless: the end user won't see much benefit. For an index entirely in the IO cache, I think these optimizations might make a big difference. In some sense, we have been allowed to hide behind the slow performance of magnetic hard drives and not worry much about reducing the CPU cost of searching. However: relatively soon most computers will use SSDs, and then suddenly it's as if every index is in the IO cache (albeit a somewhat slower one, but still far faster than magnetic media). So now is the time for us to reduce the cpu cost of searching for Lucene. And this means for this issue and other sources of optimizing search performance, we should largely focus only on indices entirely cached in the IO cache.
        Hide
        Paul Elschot added a comment -

        The 1410c patch is much like the 1410b. It adds some bug fixes and code cleanups and it includes a generator for java decompression classes. After applying the patch, run the generator once in src/java/org/apache/lucene/util/pfor :

        python gendecompress.py

        After that, in the trunk directory:

        ant -Dtestcase=TestPFor test-core

        should finish successfully in just over a minute, as it also includes decompression performance tests.

        To be done:

        • possibly split the performance tests from the functional tests, this might require adding an ant target for performance tests,
        • possibly add byte padding to move exceptions to natural position,
        • add performance tests for decompression with exception patching,
        • switch decompression to relative buffer addressing (get() instead of get(index)),
        • optimize the exception decoding (i.e. patching) code,
        • optimize the compression code,
        • generate/correct the javadocs.
        Show
        Paul Elschot added a comment - The 1410c patch is much like the 1410b. It adds some bug fixes and code cleanups and it includes a generator for java decompression classes. After applying the patch, run the generator once in src/java/org/apache/lucene/util/pfor : python gendecompress.py After that, in the trunk directory: ant -Dtestcase=TestPFor test-core should finish successfully in just over a minute, as it also includes decompression performance tests. To be done: possibly split the performance tests from the functional tests, this might require adding an ant target for performance tests, possibly add byte padding to move exceptions to natural position, add performance tests for decompression with exception patching, switch decompression to relative buffer addressing (get() instead of get(index)), optimize the exception decoding (i.e. patching) code, optimize the compression code, generate/correct the javadocs.
        Hide
        Michael McCandless added a comment -

        Paul, in decompress I added "inputSize = -1" at the top, so that the header is re-read. I need this so I can re-use a single PFor instance during decompress.

        Show
        Michael McCandless added a comment - Paul, in decompress I added "inputSize = -1" at the top, so that the header is re-read. I need this so I can re-use a single PFor instance during decompress.
        Hide
        Paul Elschot added a comment -

        Did you also move to relative addressing in the buffer?

        Another question: I suppose the place to add this initially would be in IndexOutput and IndexInput?
        In that case it would make sense to reserve (some bits of) the first byte in the compressed buffer
        for the compression method, and use these bits there to call PFor or another (de)compression method.

        Show
        Paul Elschot added a comment - Did you also move to relative addressing in the buffer? Another question: I suppose the place to add this initially would be in IndexOutput and IndexInput? In that case it would make sense to reserve (some bits of) the first byte in the compressed buffer for the compression method, and use these bits there to call PFor or another (de)compression method.
        Hide
        Michael McCandless added a comment -

        Another thing that bit me was the bufferByteSize(): if this returns
        something that's not 0 mod 4, you must increase it to the next
        multiple of 4 otherwise you will lose data since ByteBuffer is big
        endian by default. We should test little endian to see if performance
        changes (on different CPUs).

        Did you also move to relative addressing in the buffer?

        No I haven't done that, but I think we should. I believe it's faster. I'm trying now to get a rudimentary test working for TermQuery using pfor.

        Another question: I suppose the place to add this initially would be in IndexOutput and IndexInput?
        In that case it would make sense to reserve (some bits of) the first byte in the compressed buffer
        for the compression method, and use these bits there to call PFor or another (de)compression method.

        This gets into flexible indexing...

        Ideally we do this in a pluggable way, so that PFor is just one such
        plugin, simple vInts is another, etc.

        I could see a compression layer living "above" IndexInput/Output,
        since logically how you encode an int block into bytes is independent
        from the means of storage.

        But: such an abstraction may hurt performance too much since during
        read it would entail an extra buffer copy. So maybe we should just
        add methods to IndexInput/Output, or, make a new
        IntBlockInput/Output.

        Also, some things you now store in the header of each block should
        presumably move to the start of the file instead (eg the compression
        method), or if we move to a separate "schema" file that can record
        which compressor was used per file, we'd put this there.

        So I'm not yet exactly sure how we should tie this in "for real"...

        Show
        Michael McCandless added a comment - Another thing that bit me was the bufferByteSize(): if this returns something that's not 0 mod 4, you must increase it to the next multiple of 4 otherwise you will lose data since ByteBuffer is big endian by default. We should test little endian to see if performance changes (on different CPUs). Did you also move to relative addressing in the buffer? No I haven't done that, but I think we should. I believe it's faster. I'm trying now to get a rudimentary test working for TermQuery using pfor. Another question: I suppose the place to add this initially would be in IndexOutput and IndexInput? In that case it would make sense to reserve (some bits of) the first byte in the compressed buffer for the compression method, and use these bits there to call PFor or another (de)compression method. This gets into flexible indexing... Ideally we do this in a pluggable way, so that PFor is just one such plugin, simple vInts is another, etc. I could see a compression layer living "above" IndexInput/Output, since logically how you encode an int block into bytes is independent from the means of storage. But: such an abstraction may hurt performance too much since during read it would entail an extra buffer copy. So maybe we should just add methods to IndexInput/Output, or, make a new IntBlockInput/Output. Also, some things you now store in the header of each block should presumably move to the start of the file instead (eg the compression method), or if we move to a separate "schema" file that can record which compressor was used per file, we'd put this there. So I'm not yet exactly sure how we should tie this in "for real"...
        Hide
        Paul Elschot added a comment -

        I'm trying now to get a rudimentary test working for TermQuery using pfor.

        It should really make a difference for stop words and disjunction queries depending on DocIdSetIterator.next().

        Conjunctions that depend on skipTo(docNum) will probably make it necessary to impose an upperbound the size of the compressed arrays. This upperbound could be the same as the skip distance in the index.

        I'm wondering whether it would make sense to add skip info to the term positions of very large documents. Any ideas on that?

        Show
        Paul Elschot added a comment - I'm trying now to get a rudimentary test working for TermQuery using pfor. It should really make a difference for stop words and disjunction queries depending on DocIdSetIterator.next(). Conjunctions that depend on skipTo(docNum) will probably make it necessary to impose an upperbound the size of the compressed arrays. This upperbound could be the same as the skip distance in the index. I'm wondering whether it would make sense to add skip info to the term positions of very large documents. Any ideas on that?
        Hide
        Paul Elschot added a comment -

        The strange behaviour at 9 bits I reported earlier was due to a mistake in the test data. It contained only 31 integers (124 bytes, see posted listing), so the unrolled loop would never be called.

        And another point to be done:

        • change the decompression code generator to use an array get() from the buffer for 6 or more frame bits (see the Java 1/2 performance number numbers listed above.)
        Show
        Paul Elschot added a comment - The strange behaviour at 9 bits I reported earlier was due to a mistake in the test data. It contained only 31 integers (124 bytes, see posted listing), so the unrolled loop would never be called. And another point to be done: change the decompression code generator to use an array get() from the buffer for 6 or more frame bits (see the Java 1/2 performance number numbers listed above.)
        Hide
        Michael McCandless added a comment -

        It should really make a difference for stop words and disjunction queries depending on DocIdSetIterator.next().

        Yes.

        Conjunctions that depend on skipTo(docNum) will probably make it necessary to impose an upperbound the size of the compressed arrays.

        Yes. Though, I think when optimizing search performance we should
        focus entirely on the high-latency queries. TermQuery on very
        frequent terms, disjunctions queries involving common terms,
        phrase/span queries that have many matches, etc.

        EG if PFOR speeds up high-latency queries say by 20% (say 10 sec -> 8
        sec), but causes queries that are already fast (say 30 msec) to get a
        bit slower (say 40 msec) I think that's fine. It's the high-latency
        queries that kill us because those ones limit how large a collection
        you can put on one box before you're forced to shard your index.

        At some point we should make use of concurrency when iterating over
        large result sets. EG if estimated # total hits is > X docs, use
        multiple threads where each threads skips to it's own "chunk" and
        iterates over it, and then merge the results. Then we should be able
        to cut down on the max latency query and handle more documents on a
        single machine. Computers are very quickly become very concurrent.

        I'm wondering whether it would make sense to add skip info to the term positions of very large documents. Any ideas on that?

        Probably we should – yet another issue

        Show
        Michael McCandless added a comment - It should really make a difference for stop words and disjunction queries depending on DocIdSetIterator.next(). Yes. Conjunctions that depend on skipTo(docNum) will probably make it necessary to impose an upperbound the size of the compressed arrays. Yes. Though, I think when optimizing search performance we should focus entirely on the high-latency queries. TermQuery on very frequent terms, disjunctions queries involving common terms, phrase/span queries that have many matches, etc. EG if PFOR speeds up high-latency queries say by 20% (say 10 sec -> 8 sec), but causes queries that are already fast (say 30 msec) to get a bit slower (say 40 msec) I think that's fine. It's the high-latency queries that kill us because those ones limit how large a collection you can put on one box before you're forced to shard your index. At some point we should make use of concurrency when iterating over large result sets. EG if estimated # total hits is > X docs, use multiple threads where each threads skips to it's own "chunk" and iterates over it, and then merge the results. Then we should be able to cut down on the max latency query and handle more documents on a single machine. Computers are very quickly become very concurrent. I'm wondering whether it would make sense to add skip info to the term positions of very large documents. Any ideas on that? Probably we should – yet another issue
        Hide
        Michael McCandless added a comment -

        In order to understand how time is spent overall during searching, I
        took TermQuery and reimplemented it in several different ways.

        While each implementatation is correct (checksum of final top N docs
        matches) they are very much prototypes and nowhere near committable.

        I then took the 100 most frequent terms (total 23.3 million hits) from
        my Wikipedia index and ran them each in sequence. Each result is best
        of 25 runs (all results are from the OS's IO cache):

        Test Time (msec) Hits/msec Speedup
        Baseline 674 34496 1.00X
        + Code Speedups 591 39340 1.14X
        + Code Speedups + PFOR 295 78814 2.28X
        + Code Speedups + BITS 247 94130 2.73X
        + Code Speedups + BITS (native) 230 101088 2.93X

        Here's what the test names mean:

        • Baseline is the normal TermQuery, searching with TopDocsCollector
          for top 10 docs.
        • Code Speedups means some basic optimizations, eg made my own
          specialized priority queue, unrolled loops, etc.
        • PFOR means switching to PFOR for storing docs & freqs as separate
          streams. Each term's posting starts a new PFOR block.
        • BITS just means using packed n-bit ints for each block (ie, it has
          no exceptions, so it sets N so that all ints will fit). The
          resulting frq file was 18% bigger and doc file was 10% bigger –
          but this is just for the 100 most frequent terms.
        • BITS (native) is BITS but running as JNI (C++) code.

        Next, I tried running the same things above, but I turned off
        collection of hits. So this really just tests decode time:

        Test Time (msec) Hits/msec Speedup
        + Code Speedups 384 60547 1.76X
        + Code Speedups + PFOR 91 255497 7.41X
        + Code Speedups + BITS 49 474496 13.76X
        + Code Speedups + BITS (native) 32 726572 21.06X

        Some observations:

        • PFOR really does speed up TermQuery overall, so, I think we should
          pursue it and get it committed.
        • BITS is a good speedup beyond PFOR, but we haven't optimized PFOR
          yet. Also, BITS would be very seekable. We could also use PFOR
          but increase the bit size of blocks, to get the same thing.
        • Once we swap in PFOR and/or BITS or other interesting int block
          compression, accumulating hits becomes the slowest part of
          TermQuery.
        • Native BITS decoding code is quite a bit faster for decoding, but,
          probably not worth pursuing for now since with PFOR decode time
          becomes a small part of the overall time.
        • TermQuery is the simplest query; other queries will spend more
          CPU time coordinating, or time handling positions, so we can't
          yet conclude how much of an impact PFOR will have on them.
        Show
        Michael McCandless added a comment - In order to understand how time is spent overall during searching, I took TermQuery and reimplemented it in several different ways. While each implementatation is correct (checksum of final top N docs matches) they are very much prototypes and nowhere near committable. I then took the 100 most frequent terms (total 23.3 million hits) from my Wikipedia index and ran them each in sequence. Each result is best of 25 runs (all results are from the OS's IO cache): Test Time (msec) Hits/msec Speedup Baseline 674 34496 1.00X + Code Speedups 591 39340 1.14X + Code Speedups + PFOR 295 78814 2.28X + Code Speedups + BITS 247 94130 2.73X + Code Speedups + BITS (native) 230 101088 2.93X Here's what the test names mean: Baseline is the normal TermQuery, searching with TopDocsCollector for top 10 docs. Code Speedups means some basic optimizations, eg made my own specialized priority queue, unrolled loops, etc. PFOR means switching to PFOR for storing docs & freqs as separate streams. Each term's posting starts a new PFOR block. BITS just means using packed n-bit ints for each block (ie, it has no exceptions, so it sets N so that all ints will fit). The resulting frq file was 18% bigger and doc file was 10% bigger – but this is just for the 100 most frequent terms. BITS (native) is BITS but running as JNI (C++) code. Next, I tried running the same things above, but I turned off collection of hits. So this really just tests decode time: Test Time (msec) Hits/msec Speedup + Code Speedups 384 60547 1.76X + Code Speedups + PFOR 91 255497 7.41X + Code Speedups + BITS 49 474496 13.76X + Code Speedups + BITS (native) 32 726572 21.06X Some observations: PFOR really does speed up TermQuery overall, so, I think we should pursue it and get it committed. BITS is a good speedup beyond PFOR, but we haven't optimized PFOR yet. Also, BITS would be very seekable. We could also use PFOR but increase the bit size of blocks, to get the same thing. Once we swap in PFOR and/or BITS or other interesting int block compression, accumulating hits becomes the slowest part of TermQuery. Native BITS decoding code is quite a bit faster for decoding, but, probably not worth pursuing for now since with PFOR decode time becomes a small part of the overall time. TermQuery is the simplest query; other queries will spend more CPU time coordinating, or time handling positions, so we can't yet conclude how much of an impact PFOR will have on them.
        Hide
        Paul Elschot added a comment -

        So there's roughly a factor 2 between PFOR and BITS (which is actually just FOR i.e. PFOR without patches/exceptions). Meanwhile I've started working on speeding up the patching, combined with always using a multiple of 4 bytes. This also involves changing the layout for the exceptions slightly, but it should allow faster patching and avoid the byte ordering problems mentioned before. It will take another while to finish.

        Show
        Paul Elschot added a comment - So there's roughly a factor 2 between PFOR and BITS (which is actually just FOR i.e. PFOR without patches/exceptions). Meanwhile I've started working on speeding up the patching, combined with always using a multiple of 4 bytes. This also involves changing the layout for the exceptions slightly, but it should allow faster patching and avoid the byte ordering problems mentioned before. It will take another while to finish.
        Hide
        Paul Elschot added a comment - - edited

        1410d patch: as 1410c, with the following further changes:

        • moved exceptions to their natural borders,
        • unrolled exception patching code,
        • some more testcases for exceptions,
        • a specialized version for "decompressing" with 32 frame bits,
        • all other frame decompression is left to generated code (not contained in the patch),
        • compressed and decompressed sizes are now reported in numbers of integers,
        • also contains an adapted version of TestPFor2.

        This should diminish the difference in performance between PFOR and BITS.

        Show
        Paul Elschot added a comment - - edited 1410d patch: as 1410c, with the following further changes: moved exceptions to their natural borders, unrolled exception patching code, some more testcases for exceptions, a specialized version for "decompressing" with 32 frame bits, all other frame decompression is left to generated code (not contained in the patch), compressed and decompressed sizes are now reported in numbers of integers, also contains an adapted version of TestPFor2. This should diminish the difference in performance between PFOR and BITS.
        Hide
        Paul Elschot added a comment -

        I've been at this quite irregularly. I'm trying to give the PFor class a more OO interface and to get exception patching working at more decent speeds. In case someone else wants to move this forward faster than it is moving now, please holler.

        After rereading this, and also after reading up a bit on MonetDb performance improvement techniques, I have few more rants:

        Taking another look at the decompression performance figures, and especially the differences between native C++ and java, it could become worthwhile to also implement TermQuery in native code.

        With the high decompression speeds of FOR/BITS at lower numbers of frame bits it might also become worthwhile to compress character data, for example numbers with a low number of different characters.
        Adding a dictionary as in PDICT might help compression even further.
        This was probably one of the reasons for the column storage discussed earlier, I'm now sorry I ignored that discussion.
        In the index itself, column storage is also useful. One example is the splitting of document numbers and frequency into separate streams, another example is various offsets for seeking in the index.

        I think it would be worthwhile to add a compressed integer array to the basic types used in IndexInput and IndexOutput. I'm still strugling with the addition of skip info into a tree of such compressed integer arrays (skip offsets
        don't seem to fit naturally into a column, and I don't know whether the skip size should be the same as the decompressed array size).
        Placement of such compressed arrays in the index should also be aware of CPU cache lines and of VM page (disk block) boundaries.
        In higher levels of a tree of such compressed arrays, frame exceptions would be best avoided to allow direct addressing, but the leafs could use frame exceptions for better compression.

        For terms that will occur at most once in one document more compression is possible, so it might be worthwhile to add these as a key. At the moment I have no idea how to enforce the restriction of at most once though.

        Show
        Paul Elschot added a comment - I've been at this quite irregularly. I'm trying to give the PFor class a more OO interface and to get exception patching working at more decent speeds. In case someone else wants to move this forward faster than it is moving now, please holler. After rereading this, and also after reading up a bit on MonetDb performance improvement techniques, I have few more rants: Taking another look at the decompression performance figures, and especially the differences between native C++ and java, it could become worthwhile to also implement TermQuery in native code. With the high decompression speeds of FOR/BITS at lower numbers of frame bits it might also become worthwhile to compress character data, for example numbers with a low number of different characters. Adding a dictionary as in PDICT might help compression even further. This was probably one of the reasons for the column storage discussed earlier, I'm now sorry I ignored that discussion. In the index itself, column storage is also useful. One example is the splitting of document numbers and frequency into separate streams, another example is various offsets for seeking in the index. I think it would be worthwhile to add a compressed integer array to the basic types used in IndexInput and IndexOutput. I'm still strugling with the addition of skip info into a tree of such compressed integer arrays (skip offsets don't seem to fit naturally into a column, and I don't know whether the skip size should be the same as the decompressed array size). Placement of such compressed arrays in the index should also be aware of CPU cache lines and of VM page (disk block) boundaries. In higher levels of a tree of such compressed arrays, frame exceptions would be best avoided to allow direct addressing, but the leafs could use frame exceptions for better compression. For terms that will occur at most once in one document more compression is possible, so it might be worthwhile to add these as a key. At the moment I have no idea how to enforce the restriction of at most once though.
        Hide
        Paul Elschot added a comment -

        I'd like to split off the performance test cases from the functional test cases for this, but there does not seem to be a nice way to do that using the current test methods via ant.

        I'm thinking of using a special class name prefix like PerfTest... (or maybe something shorter like Perf...) and adding a test target for that in the build scripts.

        Are there other places where splitting performance tests and functional tests could be useful?
        When so, what would be a good way to implement it?

        Show
        Paul Elschot added a comment - I'd like to split off the performance test cases from the functional test cases for this, but there does not seem to be a nice way to do that using the current test methods via ant. I'm thinking of using a special class name prefix like PerfTest... (or maybe something shorter like Perf...) and adding a test target for that in the build scripts. Are there other places where splitting performance tests and functional tests could be useful? When so, what would be a good way to implement it?
        Hide
        Paul Elschot added a comment -

        The 1410e patch splits off a FrameOfRef class as the superclass from the PFor class. Test cases are split similarly, all added test cases pass.
        This should make it easier to chose between the patching version PFor and the non patching version FrameOfRef.

        There lots of small changes compared to the 1410d patch, mostly in the o.a.l.util.pfor package, and a few in the util package.
        Some javadocs are added, but javadoc generation is not yet tested.

        After applying the patch, run gendecompress.py as for the 1410b patch to generate the java sources for decompression. Then:
        ant -Dtestcase=TestFrameOfRef test-core
        and
        ant -Dtestcase=TestPFor test-core
        should pass.

        The following (at least) is still to be done:

        • peformance tests for patching decompression,
        • relative compressed data buffer addressing,
        • generate code for compression in the same way as for decompression,
        • a few fixme's and checkme's in the code,
        • javadoc generation.
        Show
        Paul Elschot added a comment - The 1410e patch splits off a FrameOfRef class as the superclass from the PFor class. Test cases are split similarly, all added test cases pass. This should make it easier to chose between the patching version PFor and the non patching version FrameOfRef. There lots of small changes compared to the 1410d patch, mostly in the o.a.l.util.pfor package, and a few in the util package. Some javadocs are added, but javadoc generation is not yet tested. After applying the patch, run gendecompress.py as for the 1410b patch to generate the java sources for decompression. Then: ant -Dtestcase=TestFrameOfRef test-core and ant -Dtestcase=TestPFor test-core should pass. The following (at least) is still to be done: peformance tests for patching decompression, relative compressed data buffer addressing, generate code for compression in the same way as for decompression, a few fixme's and checkme's in the code, javadoc generation.
        Hide
        Eks Dev added a comment -

        It looks like Google went there as well (Block encoding),

        see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format
        http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63)

        Show
        Eks Dev added a comment - It looks like Google went there as well (Block encoding), see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63)
        Hide
        Paul Elschot added a comment -

        The encoding in the google research slides is another one.
        They use 2 bits prefixing the first byte and indicating the number of bytes used for the encoded number (1-4), and then they group 4 of those prefixes together to get a single byte of 4 prefixes followed by the non prefixed bytes of the 4 encoded numbers.
        This requires a 256 way switch (indexed jump) for every 4 encoded numbers, and I would expect that jump to limit performance somewhat when compared to pfor that has a 32 way switch for 32/64/128 encoded numbers.
        But since the prefixes only indicate the numbers of bytes used for the encoded numbers, no shifts and masks are needed, only byte moves.
        So it could well be wortwhile to give this encoding a try, too, especially for lists of numbers shorter than 16 or 32.

        Show
        Paul Elschot added a comment - The encoding in the google research slides is another one. They use 2 bits prefixing the first byte and indicating the number of bytes used for the encoded number (1-4), and then they group 4 of those prefixes together to get a single byte of 4 prefixes followed by the non prefixed bytes of the 4 encoded numbers. This requires a 256 way switch (indexed jump) for every 4 encoded numbers, and I would expect that jump to limit performance somewhat when compared to pfor that has a 32 way switch for 32/64/128 encoded numbers. But since the prefixes only indicate the numbers of bytes used for the encoded numbers, no shifts and masks are needed, only byte moves. So it could well be wortwhile to give this encoding a try, too, especially for lists of numbers shorter than 16 or 32.
        Hide
        Paul Elschot added a comment -

        A very recent paper with some improvements to PFOR:
        Yan, Ding, Suel,
        Inverted Index Compression and Query Processing with Optimized Document Ordering,
        WWW 2009, April 20-24 2009, Madrid, Spain

        Roughly quoting par. 4.2, Optimizing PForDelta compression:
        For an exception, we store its lower b bits instead of the offset to the next exception in its corresponding slot, while we store the higher overflow bits and the offset in two separate arrays. These two arrays are compressed using the Simple16 method.
        Also b is chosen to optimize decompression speed. This makes the dependence of b on the data quite simple, (in the PFOR above here this dependence is more complex) and this improves compression speed.

        Btw. the document ordering there is by URL. For many terms this gives more shorter delta's between doc ids allowing a higher decompression speed of the doc ids.

        Show
        Paul Elschot added a comment - A very recent paper with some improvements to PFOR: Yan, Ding, Suel, Inverted Index Compression and Query Processing with Optimized Document Ordering, WWW 2009, April 20-24 2009, Madrid, Spain Roughly quoting par. 4.2, Optimizing PForDelta compression: For an exception, we store its lower b bits instead of the offset to the next exception in its corresponding slot, while we store the higher overflow bits and the offset in two separate arrays. These two arrays are compressed using the Simple16 method. Also b is chosen to optimize decompression speed. This makes the dependence of b on the data quite simple, (in the PFOR above here this dependence is more complex) and this improves compression speed. Btw. the document ordering there is by URL. For many terms this gives more shorter delta's between doc ids allowing a higher decompression speed of the doc ids.
        Hide
        Michael McCandless added a comment -

        Attaching sep, intblock and pfordelta codecs, spun out of the last patch on LUCENE-1458.

        Once LUCENE-1458 is in, we should finish the pfordelta codec to make it a real choice.

        I actually think some combination of pulsing, standard, pfordelta and simple bit packing (in order by increasing term's docFreq), within a single codec, may be best.

        Ie, rare terms (only in a doc or two) could be inlined into the the terms dict. Slightly more common terms can use the more CPU intensive standard codec. Common terms can use cpu-friendly-yet-still-decent-compression pfordelta. Obsenely common terms can use bit packing for the fastest decode.

        Show
        Michael McCandless added a comment - Attaching sep, intblock and pfordelta codecs, spun out of the last patch on LUCENE-1458 . Once LUCENE-1458 is in, we should finish the pfordelta codec to make it a real choice. I actually think some combination of pulsing, standard, pfordelta and simple bit packing (in order by increasing term's docFreq), within a single codec, may be best. Ie, rare terms (only in a doc or two) could be inlined into the the terms dict. Slightly more common terms can use the more CPU intensive standard codec. Common terms can use cpu-friendly-yet-still-decent-compression pfordelta. Obsenely common terms can use bit packing for the fastest decode.
        Hide
        Eks Dev added a comment -

        Mike,
        That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment.

        Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly "compressible" postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ...

        Thinking about your comments, isn't pulsing somewhat orthogonal to packing method? For example, if you load index into RAMDirecectory, one could avoid one indirection level and inline all postings.

        Flex Indexing rocks, that is going to be the most important addition to lucene since it started (imo)... I would even bet on double search speed in first attempt for average queries

        Cheers,
        eks

        Show
        Eks Dev added a comment - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly "compressible" postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... Thinking about your comments, isn't pulsing somewhat orthogonal to packing method? For example, if you load index into RAMDirecectory, one could avoid one indirection level and inline all postings. Flex Indexing rocks, that is going to be the most important addition to lucene since it started (imo)... I would even bet on double search speed in first attempt for average queries Cheers, eks
        Hide
        Paul Elschot added a comment -

        To encode the exceptions as suggested by Yan,Ding&Suel, Simple9 or Simple16 can be used. A Simple9 implementation is at LUCENE-2189.

        Show
        Paul Elschot added a comment - To encode the exceptions as suggested by Yan,Ding&Suel, Simple9 or Simple16 can be used. A Simple9 implementation is at LUCENE-2189 .
        Hide
        Renaud Delbru added a comment - - edited

        Hi,

        I have performed some benchmarks using the PFOR index I/O interface in order to check if the index reader and block reader were not adding too much overhead (I was afraid that the block reading interface was adding too much overhead, and, as a consequence, loosing the decompression speed benefits of block based compression.

        In this benchmark, I have jsut tested FOR (and not PFOR) using various block size. The benchmark is setup as follow:

        • I generate an integer array of size 33554432, containing uniformly distributed integer (0 <= x < 65535)
        • I compress the data using PFORDeltaIndexOutput (in fact, aa similar class that I modified for this benchmark in order to support various blocksize)
        • I measure the time to decompress the full list of integers using PFORDeltaIndexInput#next()

        I have performed a similar test using a basic IndexInput/Output with VInt encoding. The performance was 39Kints/msec.

        For FrameOfRef, the best block size seems to be 4096 (256 - 270 kints / msec), larger block size do not provide significant performance improvement,

        We can observe that FOR provides ~7 times read performance increase. To conclude, it looks like the reader and block reader interface do not add to much overhead ;o).

        P.S.: It is possible that these results are not totally correct. I'll try to double check the code, and upload it here.

        Results:

        BlockSize = 32
        FrameOfRef 0 decompressed 33554432 in 368 msecs, 91 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 314 msecs, 106 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 294 msecs, 114 kints/msec, (1 iters).
        BlockSize = 64
        FrameOfRef 0 decompressed 33554432 in 242 msecs, 138 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 239 msecs, 140 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 237 msecs, 141 kints/msec, (1 iters).
        BlockSize = 128
        FrameOfRef 0 decompressed 33554432 in 223 msecs, 150 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 228 msecs, 147 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 224 msecs, 149 kints/msec, (1 iters).
        BlockSize = 256
        FrameOfRef 0 decompressed 33554432 in 219 msecs, 153 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 218 msecs, 153 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 219 msecs, 153 kints/msec, (1 iters).
        BlockSize = 512
        FrameOfRef 0 decompressed 33554432 in 170 msecs, 197 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 176 msecs, 190 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 173 msecs, 193 kints/msec, (1 iters).
        BlockSize = 1024
        FrameOfRef 0 decompressed 33554432 in 136 msecs, 246 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 139 msecs, 241 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 147 msecs, 228 kints/msec, (1 iters).
        BlockSize = 2048
        FrameOfRef 0 decompressed 33554432 in 133 msecs, 252 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 135 msecs, 248 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 139 msecs, 241 kints/msec, (1 iters).
        BlockSize = 4096
        FrameOfRef 0 decompressed 33554432 in 124 msecs, 270 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 131 msecs, 256 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 131 msecs, 256 kints/msec, (1 iters).
        BlockSize = 8192
        FrameOfRef 0 decompressed 33554432 in 126 msecs, 266 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 128 msecs, 262 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 127 msecs, 264 kints/msec, (1 iters).
        BlockSize = 16384
        FrameOfRef 0 decompressed 33554432 in 127 msecs, 264 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 125 msecs, 268 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 129 msecs, 260 kints/msec, (1 iters).
        BlockSize = 32768
        FrameOfRef 0 decompressed 33554432 in 123 msecs, 272 kints/msec, (1 iters).
        FrameOfRef 1 decompressed 33554432 in 132 msecs, 254 kints/msec, (1 iters).
        FrameOfRef 2 decompressed 33554432 in 135 msecs, 248 kints/msec, (1 iters).
        

        EDIT: Here is new results comparing FOR and a block-based VInt using various block size. The decompression loop is repeated to reach > 300ms (for JIT effect, see post below). The loop recreates a new block reader each time (which causes some overhead, as you can see with the FOR performance results, compared to the one below).

          32 64 128 256 512 1024 2048 4096 8192 16384 32768
        VInt (kints/msec) 28 30 30 31 48 65 84 94 100 104 104
        FOR (kints/msec) 104 126 131 132 164 195 202 214 217 220 223
        Show
        Renaud Delbru added a comment - - edited Hi, I have performed some benchmarks using the PFOR index I/O interface in order to check if the index reader and block reader were not adding too much overhead (I was afraid that the block reading interface was adding too much overhead, and, as a consequence, loosing the decompression speed benefits of block based compression. In this benchmark, I have jsut tested FOR (and not PFOR) using various block size. The benchmark is setup as follow: I generate an integer array of size 33554432, containing uniformly distributed integer (0 <= x < 65535) I compress the data using PFORDeltaIndexOutput (in fact, aa similar class that I modified for this benchmark in order to support various blocksize) I measure the time to decompress the full list of integers using PFORDeltaIndexInput#next() I have performed a similar test using a basic IndexInput/Output with VInt encoding. The performance was 39Kints/msec. For FrameOfRef, the best block size seems to be 4096 (256 - 270 kints / msec), larger block size do not provide significant performance improvement, We can observe that FOR provides ~7 times read performance increase. To conclude, it looks like the reader and block reader interface do not add to much overhead ;o). P.S.: It is possible that these results are not totally correct. I'll try to double check the code, and upload it here. Results: BlockSize = 32 FrameOfRef 0 decompressed 33554432 in 368 msecs, 91 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 314 msecs, 106 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 294 msecs, 114 kints/msec, (1 iters). BlockSize = 64 FrameOfRef 0 decompressed 33554432 in 242 msecs, 138 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 239 msecs, 140 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 237 msecs, 141 kints/msec, (1 iters). BlockSize = 128 FrameOfRef 0 decompressed 33554432 in 223 msecs, 150 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 228 msecs, 147 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 224 msecs, 149 kints/msec, (1 iters). BlockSize = 256 FrameOfRef 0 decompressed 33554432 in 219 msecs, 153 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 218 msecs, 153 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 219 msecs, 153 kints/msec, (1 iters). BlockSize = 512 FrameOfRef 0 decompressed 33554432 in 170 msecs, 197 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 176 msecs, 190 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 173 msecs, 193 kints/msec, (1 iters). BlockSize = 1024 FrameOfRef 0 decompressed 33554432 in 136 msecs, 246 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 139 msecs, 241 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 147 msecs, 228 kints/msec, (1 iters). BlockSize = 2048 FrameOfRef 0 decompressed 33554432 in 133 msecs, 252 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 135 msecs, 248 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 139 msecs, 241 kints/msec, (1 iters). BlockSize = 4096 FrameOfRef 0 decompressed 33554432 in 124 msecs, 270 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 131 msecs, 256 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 131 msecs, 256 kints/msec, (1 iters). BlockSize = 8192 FrameOfRef 0 decompressed 33554432 in 126 msecs, 266 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 128 msecs, 262 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 127 msecs, 264 kints/msec, (1 iters). BlockSize = 16384 FrameOfRef 0 decompressed 33554432 in 127 msecs, 264 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 125 msecs, 268 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 129 msecs, 260 kints/msec, (1 iters). BlockSize = 32768 FrameOfRef 0 decompressed 33554432 in 123 msecs, 272 kints/msec, (1 iters). FrameOfRef 1 decompressed 33554432 in 132 msecs, 254 kints/msec, (1 iters). FrameOfRef 2 decompressed 33554432 in 135 msecs, 248 kints/msec, (1 iters). EDIT: Here is new results comparing FOR and a block-based VInt using various block size. The decompression loop is repeated to reach > 300ms (for JIT effect, see post below). The loop recreates a new block reader each time (which causes some overhead, as you can see with the FOR performance results, compared to the one below).   32 64 128 256 512 1024 2048 4096 8192 16384 32768 VInt (kints/msec) 28 30 30 31 48 65 84 94 100 104 104 FOR (kints/msec) 104 126 131 132 164 195 202 214 217 220 223
        Hide
        Paul Elschot added a comment -

        Did you see any effect of the JIT? This is possible by making a loop that runs for about 1 second, and use that loop 3 times, see the output posting of 3 Oct 2008. When the 2nd and 3rd time are just about equal one can assume that the JIT is done.
        Perhaps a shorter time than 1 second can be used nowadays.

        For the blocks that you've been measuring one might also see an effect of L1/L2 data cache size.
        This effect is probably not present in my previous postings here because I have not yet used this on larger blocks.

        Show
        Paul Elschot added a comment - Did you see any effect of the JIT? This is possible by making a loop that runs for about 1 second, and use that loop 3 times, see the output posting of 3 Oct 2008. When the 2nd and 3rd time are just about equal one can assume that the JIT is done. Perhaps a shorter time than 1 second can be used nowadays. For the blocks that you've been measuring one might also see an effect of L1/L2 data cache size. This effect is probably not present in my previous postings here because I have not yet used this on larger blocks.
        Hide
        Renaud Delbru added a comment -

        Did you see any effect of the JIT? This is possible by making a loop that runs for about 1 second, and use that loop 3 times, see the output posting of 3 Oct 2008. When the 2nd and 3rd time are just about equal one can assume that the JIT is done.

        I haven't seen the effect of the JIT on FOR.
        I have re-performed the benchmark, repeating the decompression until the test time is superior to 300 ms or 1s (as it is done in TestFrameOfRef#doDecompPerfTestUsing() method), but I haven't observed a difference with the previous benchmark.
        In fact, it seems that it happens only for block size of 32, in the other cases, it seems imperceptible (in the previous benchmark, the number looks unstable, in the new benchmark however it looks more stable, but no significant increase between the first loop and the other ones).

        In the first benchmark, I was not repeating the loop until 1 second is reached since there is no easy way to "reset" the reader. In the new benchmark, I am recreating the reader in the loop (which causes some overhead). Do you think it can have a consequence on the JIT ?

        For the blocks that you've been measuring one might also see an effect of L1/L2 data cache size.

        Yes, it should be an effect of the cache size. It was to check the increase of performance and find the optimal block size. This optimal block size may be dependent on my hardware (Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz, cache size : 4096 KB).

        Show
        Renaud Delbru added a comment - Did you see any effect of the JIT? This is possible by making a loop that runs for about 1 second, and use that loop 3 times, see the output posting of 3 Oct 2008. When the 2nd and 3rd time are just about equal one can assume that the JIT is done. I haven't seen the effect of the JIT on FOR. I have re-performed the benchmark, repeating the decompression until the test time is superior to 300 ms or 1s (as it is done in TestFrameOfRef#doDecompPerfTestUsing() method), but I haven't observed a difference with the previous benchmark. In fact, it seems that it happens only for block size of 32, in the other cases, it seems imperceptible (in the previous benchmark, the number looks unstable, in the new benchmark however it looks more stable, but no significant increase between the first loop and the other ones). In the first benchmark, I was not repeating the loop until 1 second is reached since there is no easy way to "reset" the reader. In the new benchmark, I am recreating the reader in the loop (which causes some overhead). Do you think it can have a consequence on the JIT ? For the blocks that you've been measuring one might also see an effect of L1/L2 data cache size. Yes, it should be an effect of the cache size. It was to check the increase of performance and find the optimal block size. This optimal block size may be dependent on my hardware (Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz, cache size : 4096 KB).
        Hide
        Paul Elschot added a comment -

        In the first benchmark, I was not repeating the loop until 1 second is reached since there is no easy way to "reset" the reader. In the new benchmark, I am recreating the reader in the loop (which causes some overhead). Do you think it can have a consequence on the JIT ?

        I don't know how wheather JIT deals with the code for invidual objects (other than classes and their code). There could be an indirect effect via the garbage collector.

        The uniform random range of about 65000 generates almost always 14-16 bits per number, so these results are roughly comparable to the 14-16 bit cases posted on 6 October 2008 by Michael. As it happens they are quite close.

        Show
        Paul Elschot added a comment - In the first benchmark, I was not repeating the loop until 1 second is reached since there is no easy way to "reset" the reader. In the new benchmark, I am recreating the reader in the loop (which causes some overhead). Do you think it can have a consequence on the JIT ? I don't know how wheather JIT deals with the code for invidual objects (other than classes and their code). There could be an indirect effect via the garbage collector. The uniform random range of about 65000 generates almost always 14-16 bits per number, so these results are roughly comparable to the 14-16 bit cases posted on 6 October 2008 by Michael. As it happens they are quite close.
        Hide
        Renaud Delbru added a comment - - edited

        On another aspect, why the PFOR/FOR is encoding the number of compressed integers into the block header since this information is already stored in the stream header (block size information written in FixedIntBlockIndexOutput#init()). Is there a particular use case for that ? Is it for the special case when a block is complete (when the block encodes the reamaining integer of the list) ?

        Show
        Renaud Delbru added a comment - - edited On another aspect, why the PFOR/FOR is encoding the number of compressed integers into the block header since this information is already stored in the stream header (block size information written in FixedIntBlockIndexOutput#init()). Is there a particular use case for that ? Is it for the special case when a block is complete (when the block encodes the reamaining integer of the list) ?
        Hide
        Paul Elschot added a comment -

        The only reason why the number of compressed integers is encoded in the block header here is that when I coded it I did not know that this was not necessary in lucene indexes.

        That also means that the header can be used for different compression methods, for example in the following way:
        cases encoded in 1st byte:
        32 FrameOfRef cases (#frameBits) followed by 3 bytes for #exceptions (0 for BITS, > 0 for PFOR)
        16-64 cases for a SimpleNN variant
        1-8 cases for run length encoding (for example followed by 3 bytes for length and value)
        Total #cases is 49-104 or 6-7 bits.

        Run length encoding is good for terms that occur in every document and for the frequencies of primary keys.

        The only concern I have is that the instruction cache might get filled up with the code for all these decoding cases.
        At the moment I don't know how to deal with that other than by adding such cases slowly while doing performance tests all the time.

        Show
        Paul Elschot added a comment - The only reason why the number of compressed integers is encoded in the block header here is that when I coded it I did not know that this was not necessary in lucene indexes. That also means that the header can be used for different compression methods, for example in the following way: cases encoded in 1st byte: 32 FrameOfRef cases (#frameBits) followed by 3 bytes for #exceptions (0 for BITS, > 0 for PFOR) 16-64 cases for a SimpleNN variant 1-8 cases for run length encoding (for example followed by 3 bytes for length and value) Total #cases is 49-104 or 6-7 bits. Run length encoding is good for terms that occur in every document and for the frequencies of primary keys. The only concern I have is that the instruction cache might get filled up with the code for all these decoding cases. At the moment I don't know how to deal with that other than by adding such cases slowly while doing performance tests all the time.
        Hide
        Renaud Delbru added a comment -

        The only concern I have is that the instruction cache might get filled up with the code for all these decoding cases.
        At the moment I don't know how to deal with that other than by adding such cases slowly while doing performance tests all the time.

        I am not very familiar with such technologies, and it is new to me to start thinking of this kind of problems.
        However, from what I understand in the WSDM slides about Google Group Varint encoding, they are using a 256-entry table, which is much higher than the 49-107 cases you're proposing. From what I understand, the instruction cache will be filled with such table, so it seems you have still some margin compared to Group Varint encoding. Or is there other information that risks to be added to the instruction cache ?

        Show
        Renaud Delbru added a comment - The only concern I have is that the instruction cache might get filled up with the code for all these decoding cases. At the moment I don't know how to deal with that other than by adding such cases slowly while doing performance tests all the time. I am not very familiar with such technologies, and it is new to me to start thinking of this kind of problems. However, from what I understand in the WSDM slides about Google Group Varint encoding, they are using a 256-entry table, which is much higher than the 49-107 cases you're proposing. From what I understand, the instruction cache will be filled with such table, so it seems you have still some margin compared to Group Varint encoding. Or is there other information that risks to be added to the instruction cache ?
        Hide
        Paul Elschot added a comment -

        There is a margin indeed, but it depends on the cache size. For the decoding of all 32 FrameOfRef cases, a rough estimation would be 32 cases * (av. 16 indexed reads, 32 shifts, 32 masks, 32 indexed writes) = 32 * 112 = 3500 instructions. With 3 bytes per instruction on avarage that would be about 10k byte, in a normal instruction cache of 16k or 32k byte.
        For Simple9 there is a lot less code, (fewer cases, less code per case) but when a lot of variations on this are added, perhaps 3k would be needed in total.
        So it's not only the number of cases, but also the amount of code per case.
        Fortunately not all cases are used frequently, so they won't appear in the instruction cache.

        This can be compared that to the code needed for VByte, which is just about nothing.

        During query execution there are other things in the instruction cache, too, for example the code of the Scorers used by the query and the code to get the data from the index.
        Initially the byte code interpreter and the JIT compiler of the JVM are also in the instruction cache, but they normally disappear when the JIT is finished. But with more code, the JIT will take longer to finish.

        Show
        Paul Elschot added a comment - There is a margin indeed, but it depends on the cache size. For the decoding of all 32 FrameOfRef cases, a rough estimation would be 32 cases * (av. 16 indexed reads, 32 shifts, 32 masks, 32 indexed writes) = 32 * 112 = 3500 instructions. With 3 bytes per instruction on avarage that would be about 10k byte, in a normal instruction cache of 16k or 32k byte. For Simple9 there is a lot less code, (fewer cases, less code per case) but when a lot of variations on this are added, perhaps 3k would be needed in total. So it's not only the number of cases, but also the amount of code per case. Fortunately not all cases are used frequently, so they won't appear in the instruction cache. This can be compared that to the code needed for VByte, which is just about nothing. During query execution there are other things in the instruction cache, too, for example the code of the Scorers used by the query and the code to get the data from the index. Initially the byte code interpreter and the JIT compiler of the JVM are also in the instruction cache, but they normally disappear when the JIT is finished. But with more code, the JIT will take longer to finish.
        Hide
        Paul Elschot added a comment -

        Zhang, 2008, see above, reports this:

        The poor speed of variable- byte on position data is primarily due
        to the fact that position values are larger and more often require
        2 bytes under variable-byte; this case tends to be much slower due
        to a branch mispredict.

        Taking another look at the position data above (3 October 2008) 11.6% of prx values take 7 bits or less,
        and the rest fits in 15 bits. So why not encode the position data as VShort (1 bit as in VByte and 15 bits data) ?
        That would enlarge a typical prx file by about 6% and increase position decoding speed a lot,
        probably about 3 times (see Table 1 in the same paper).

        Show
        Paul Elschot added a comment - Zhang, 2008, see above, reports this: The poor speed of variable- byte on position data is primarily due to the fact that position values are larger and more often require 2 bytes under variable-byte; this case tends to be much slower due to a branch mispredict. Taking another look at the position data above (3 October 2008) 11.6% of prx values take 7 bits or less, and the rest fits in 15 bits. So why not encode the position data as VShort (1 bit as in VByte and 15 bits data) ? That would enlarge a typical prx file by about 6% and increase position decoding speed a lot, probably about 3 times (see Table 1 in the same paper).
        Hide
        Michael McCandless added a comment -

        So why not encode the position data as VShort (1 bit as in VByte and 15 bits data) ?

        That sounds interesting... we should give it a shot and see what gains Lucene actually sees on real data!

        Show
        Michael McCandless added a comment - So why not encode the position data as VShort (1 bit as in VByte and 15 bits data) ? That sounds interesting... we should give it a shot and see what gains Lucene actually sees on real data!
        Hide
        Paul Elschot added a comment -

        I opened LUCENE-2232 to use VShort for positions.

        Show
        Paul Elschot added a comment - I opened LUCENE-2232 to use VShort for positions.
        Hide
        Renaud Delbru added a comment -

        When performing some tests on PFOR, I have noticed that your algorithm was favoring very small numFrameBits for a very large number of exceptions. For example, on block of size 512, I have noticed that many of the blocks was with bestFrameBits = 1, and the number of exceptions reaching > 450.

        I found that this was due to the seeting of the allowedNumExceptions variable (in the last part of the PFOR#frameBitsForCompression() method) which was set to the number of current exceptions + the maximum allowed (which at the end is generally extremely large).Is it a bug, or is it something I don't understand in the current PFOR algorithm ?

        P.S.: btw, the previous benchmark results I have posted are wrong due to a bug which was due to the hardcoded byte buffer size (1024) in PForIndexInput/Output. I'll post soon updated results, with a comparison with GroupVarInt (from WSDM09 - Jeff Dean talk).

        Show
        Renaud Delbru added a comment - When performing some tests on PFOR, I have noticed that your algorithm was favoring very small numFrameBits for a very large number of exceptions. For example, on block of size 512, I have noticed that many of the blocks was with bestFrameBits = 1, and the number of exceptions reaching > 450. I found that this was due to the seeting of the allowedNumExceptions variable (in the last part of the PFOR#frameBitsForCompression() method) which was set to the number of current exceptions + the maximum allowed (which at the end is generally extremely large).Is it a bug, or is it something I don't understand in the current PFOR algorithm ? P.S.: btw, the previous benchmark results I have posted are wrong due to a bug which was due to the hardcoded byte buffer size (1024) in PForIndexInput/Output. I'll post soon updated results, with a comparison with GroupVarInt (from WSDM09 - Jeff Dean talk).
        Hide
        Paul Elschot added a comment -

        That might indeed be a bug.
        In general the frameBitsForCompression() method should try and find the number that has the quickest decompression.
        The trade off is between exceptions taking extra time, and a smaller number of frame bits taking less time.
        I have not timed the exception decoding yet, so I would not expect the current implementation of frameBitsForCompression() to be right on the spot.

        Please note that I'd still like to change the exception encoding, see the comment of 12 May 2009:
        "For an exception, we store its lower b bits instead of the offset to the next exception in its corresponding slot, while we store the higher overflow bits and the offset in two separate arrays. These two arrays are compressed using the Simple16 method."

        Show
        Paul Elschot added a comment - That might indeed be a bug. In general the frameBitsForCompression() method should try and find the number that has the quickest decompression. The trade off is between exceptions taking extra time, and a smaller number of frame bits taking less time. I have not timed the exception decoding yet, so I would not expect the current implementation of frameBitsForCompression() to be right on the spot. Please note that I'd still like to change the exception encoding, see the comment of 12 May 2009: "For an exception, we store its lower b bits instead of the offset to the next exception in its corresponding slot, while we store the higher overflow bits and the offset in two separate arrays. These two arrays are compressed using the Simple16 method."
        Hide
        Renaud Delbru added a comment -

        In general the frameBitsForCompression() method should try and find the number that has the quickest decompression.

        In fact, the method is trying to find the configuration that has the smaller size in term of bytes (with the test totalBytes <= bestBytes). But it seems that it is not always the best case for quick decompression. I haven't test the decompression speed of PFOR yet, but I imagine that with 89% of exceptions (while in the original algorithm exception should occurs only a few times), it should not be the best performance.

        Why not just computing the index (in the sorted copy array) that will represent the percentage of values that should be packed, and the remaining of the array encoded as exceptions ?

        Show
        Renaud Delbru added a comment - In general the frameBitsForCompression() method should try and find the number that has the quickest decompression. In fact, the method is trying to find the configuration that has the smaller size in term of bytes (with the test totalBytes <= bestBytes). But it seems that it is not always the best case for quick decompression. I haven't test the decompression speed of PFOR yet, but I imagine that with 89% of exceptions (while in the original algorithm exception should occurs only a few times), it should not be the best performance. Why not just computing the index (in the sorted copy array) that will represent the percentage of values that should be packed, and the remaining of the array encoded as exceptions ?
        Hide
        Paul Elschot added a comment -

        Why not just computing the index (in the sorted copy array) that will represent the percentage of values that should be packed, and the remaining of the array encoded as exceptions ?

        Because there are cases in which no exceptions are needed, for example 90% of the values using the same max nr of bits.

        Anyway, the articles on PFor do not completely specify how this should be done, and they are based on C, not on Java.
        That means there will be room for improvement.

        Show
        Paul Elschot added a comment - Why not just computing the index (in the sorted copy array) that will represent the percentage of values that should be packed, and the remaining of the array encoded as exceptions ? Because there are cases in which no exceptions are needed, for example 90% of the values using the same max nr of bits. Anyway, the articles on PFor do not completely specify how this should be done, and they are based on C, not on Java. That means there will be room for improvement.
        Hide
        Renaud Delbru added a comment -

        I am reporting here some experiments I performed over the past weeks while I was trying to improve the FOR implementation.
        I re-implement the FOR algorithms by getting rid of the IntBuffer and working directly with the byte array. I have implemented multiple variants, such as one that directly performs transformation over bytes to create the uncompressed data (what I call byte-level in the next), and one that first convert bytes into integers, and then performs transformation on integers to create the uncompress data (what I call integer-level in the next). The last one is very similar to your original FOR implementation, but without the IntBuffer.

        I think these results can be of interest for you, especially to optimise certain cases (byte level manipulation for certain cases such as bit frame of 2, 4 or 8 seems more suitable). I have attached a file containing a summary of the results for space consideration. I can provide you the raw results, and the code if you would like to test it on your side.
        However, I get very different results if I perform the benchmarks on a 64 bits OS or 32 Bits OS (on a same computer, IBM T61, the only difference is that on one computer Ubuntu 9.10 32 bits is installed, on the other one it is Ubuntu 9.10 64 bits).

        I am a little bit puzzled by these results. I thought that removing the IntBuffer and working directly with the byte array will be faster (as I noticed in other compression algorithms, such as GroupVarInt). The IntBuffer you are currently using is a view on a byte buffer. It therefore does the conversion between byte to int, plus it does several checks (if the index is in the range of the buffer) and function calls.
        But it seems that with FOR this does not make too much a difference on large integers (> 8 bits). Moreover, I observe a decrease of performance on 64 bits OS.
        Maybe, you have an idea about the difference in behavior.

        Show
        Renaud Delbru added a comment - I am reporting here some experiments I performed over the past weeks while I was trying to improve the FOR implementation. I re-implement the FOR algorithms by getting rid of the IntBuffer and working directly with the byte array. I have implemented multiple variants, such as one that directly performs transformation over bytes to create the uncompressed data (what I call byte-level in the next), and one that first convert bytes into integers, and then performs transformation on integers to create the uncompress data (what I call integer-level in the next). The last one is very similar to your original FOR implementation, but without the IntBuffer. I think these results can be of interest for you, especially to optimise certain cases (byte level manipulation for certain cases such as bit frame of 2, 4 or 8 seems more suitable). I have attached a file containing a summary of the results for space consideration. I can provide you the raw results, and the code if you would like to test it on your side. However, I get very different results if I perform the benchmarks on a 64 bits OS or 32 Bits OS (on a same computer, IBM T61, the only difference is that on one computer Ubuntu 9.10 32 bits is installed, on the other one it is Ubuntu 9.10 64 bits). I am a little bit puzzled by these results. I thought that removing the IntBuffer and working directly with the byte array will be faster (as I noticed in other compression algorithms, such as GroupVarInt). The IntBuffer you are currently using is a view on a byte buffer. It therefore does the conversion between byte to int, plus it does several checks (if the index is in the range of the buffer) and function calls. But it seems that with FOR this does not make too much a difference on large integers (> 8 bits). Moreover, I observe a decrease of performance on 64 bits OS. Maybe, you have an idea about the difference in behavior.
        Hide
        Renaud Delbru added a comment -

        File containing a summary of the results for each bit frame, based on variants of FOR.

        Show
        Renaud Delbru added a comment - File containing a summary of the results for each bit frame, based on variants of FOR.
        Hide
        Paul Elschot added a comment -

        I thought that removing the IntBuffer and working directly with the byte array will be faster ...

        When the int values are in processor byte order, a call to IntBuffer.get() may be reduced by the JIT to a single hardware instruction. This is why the initial implementation uses IntBuffer.
        Also, the index bound checks need only be done once for the first and last index used.

        I have no idea why a 64 bit OS would be slower than a 32 bit OS.

        Show
        Paul Elschot added a comment - I thought that removing the IntBuffer and working directly with the byte array will be faster ... When the int values are in processor byte order, a call to IntBuffer.get() may be reduced by the JIT to a single hardware instruction. This is why the initial implementation uses IntBuffer. Also, the index bound checks need only be done once for the first and last index used. I have no idea why a 64 bit OS would be slower than a 32 bit OS.
        Hide
        Renaud Delbru added a comment - - edited

        Here some results on the performance of compression-decompression of various algorithms over the wikipedia dataset (english articles).

        First, the performance on compressing/decompressing over the full .doc posting list using block size of 1024.
        The Compression and Decompression are in KInt/ms and the size in bytes.

        Codec Compression Decompression Size
        FOR (Orig) 20 106 448856
        PFOR (Orig) 8 107 383596
        VINT 37 74 421764
        FOR (Opt) 39 102 447369
        PFOR (Opt) 10 108 382379
        RICE 8 31 350687
        S9 16 65 408218
        • VInt is a block based version of variable integer (unlike the simple int block, I am creating blocks using vint, then read the full block in memory and decompress it using vint one integer at a time).
        • For (Orig) and PFOR (Orig) are your implementation of FOR and PFOR.
        • FOR (Opt) and PFOR (Opt) are my implementation of FOR and PFOR (using array of functors for each compression decompression case).
        • Rice is one implementation of the rice algorithm, but block-based.
        • S9 is your implementation of simple 9 with some bug fixes, but block-based.

        I have created a *IndexInput and *IndexOutput for each algorithm, and use them to compress and decompress the posting file in this experiment.

        We can see that using an array of functors for compression in FOR provides a good speed increase in compression. However, on PFOR, the compression speed increase is limited. From my test, it is due to the sort step which is necessary for compressing each block.
        PFOR provides a good balance between decompression speed and size, but it is very slow in compression (it is as slow as Rice). I don't think it is possible to improve the compression speed due to the sort step, which seems to impose a hard upper limit in term of performance (I have tried various heuristics to avoid sorting, but not very successful so far).

        Following this first benchmark, I have created various Lucene codec that uses the *IndexInput and *IndexOutput of the different compression algorithms. I have indexed the full wikipedia dataset. Each block-based codec is configured to use block of 1024.
        I have recorded the average commit time and the optimise time. Then, I have executing random keyword queries and I have recorded the average query time and the number of bytes read for answering the query.

        Compression

        Time is in ms, Size in bytes.

        Codec Avg Commit Time Total Commit Time Optimise Time Index Size
        STD 1276 359978 113386 1423251
        SEP 1437 405286 117163 1666870
        VINT 1572 426551 122557 1742083
        RICE 1791 505312 230636 1339703
        S9 1575 444157 146216 1530666
        FOR 1520 428847 134502 1754578
        PFOR 1852 522401 189262 1511154
        • STD is the Standard lucene codec.
        • SEP the Sep codec.
        • All the other algorithms are based on the Sep codec, but block-based.
        • FOR and PFOR is my implementation.

        We can see that the sep codec and block-based codecs have a certain overhead (in indexing time and index size) compared to the standard lucene codec. But, I imagine that this overhead can be reduced with some code optimisation. For the index size, the overhead in block-based codeca is due to the skip list, which is much bigger (127MB for SEP against 189MB for block-based) since we need to encode the block offset, and the inner document offset in the block.

        In term of compression speed and index size, we can see that this results follows the previous results. We can observe also that PFOR is the slowest.

        Decompression

        For each codec, I have performed five runs, each run having 200 random keyword queries, and average the time over the 5 runs and 200 queries (i.e., 200*5 = 1000 query execution).
        For each run, I am opening a new IndexReader and IndexSearcher and I am reusing them across the 200 random queries.

        To generate the queries, I first grouped the terms into three group: HIGH, MEDIUM and LOW, based on their frequency, in order to be able to generate random keyword queries based on this three frequency groups. Then, I have performed benchmarks with random queries of one, two or more keywords, using all possible combinations (e.g., HIGH & HIGH, HIGH & MEDIUM, etc). I have executed the queries using a single thread, and also using multiple threads.

        I am reporting below a few of query results, but for most of the other queries, the results were following the same scheme. I don't report the results of the Standard and Sep codec, but they always outperform block-based codec, whatever compression used.

        However, the results for the query time are completely contradictory with the results of the first benchmark. The slowest algorithm (Rice, Vint) generally provides the faster query execution time, and the faster algorithm (FOR, PFOR) provides the slowest query execution time. Simple9 seems to be as fast as VInt.
        I currently have no explanation for that. Maybe the dataset (wikipedia articles) is too small to see the benefits. But I would say instead there is a problem somewhere. How come VInt and even Rice could provide better query execution times than FOR and PFOR ? While, in the first benchmark, it is clear that FOR and PFOR provides nearly 40% of decompression performance increase.
        Do you have some advices on how to improve the benchmark, or some ideas on the causes of this frop of performance ? I'll be able to re-perform the benchmarks if you propose something.

        HIGH:MUST - 1 thread - 200 random queries
        Codec Avg Query Time
        Rice 1.5 ms
        VInt 1.2 ms
        PFOR 2.3 ms
        FOR 2.5 ms
        S9 1 ms
        HIGH:MUST - 2 thread - 200 random queries for each thread
        Codec Avg Query Time
        Rice 2 ms
        VInt 1.9 ms
        PFOR 3 ms
        FOR 3.5 ms
        S9 1.6 ms
        HIGH:MUST HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD - 1 thread - 200 random queries
        Codec Avg Query Time
        Rice 1.5 ms
        VInt 1.2 ms
        PFOR 2.7 ms
        FOR 2.5 ms
        S9 1.3 ms
        HIGH:MUST HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD - 2 threads - 200 random queries for each thread
        Rice 2.5 ms
        VInt 2 ms
        PFOR 3 ms
        FOR 3.4 ms
        S9 2 ms
        Show
        Renaud Delbru added a comment - - edited Here some results on the performance of compression-decompression of various algorithms over the wikipedia dataset (english articles). First, the performance on compressing/decompressing over the full .doc posting list using block size of 1024. The Compression and Decompression are in KInt/ms and the size in bytes. Codec Compression Decompression Size FOR (Orig) 20 106 448856 PFOR (Orig) 8 107 383596 VINT 37 74 421764 FOR (Opt) 39 102 447369 PFOR (Opt) 10 108 382379 RICE 8 31 350687 S9 16 65 408218 VInt is a block based version of variable integer (unlike the simple int block, I am creating blocks using vint, then read the full block in memory and decompress it using vint one integer at a time). For (Orig) and PFOR (Orig) are your implementation of FOR and PFOR. FOR (Opt) and PFOR (Opt) are my implementation of FOR and PFOR (using array of functors for each compression decompression case). Rice is one implementation of the rice algorithm, but block-based. S9 is your implementation of simple 9 with some bug fixes, but block-based. I have created a *IndexInput and *IndexOutput for each algorithm, and use them to compress and decompress the posting file in this experiment. We can see that using an array of functors for compression in FOR provides a good speed increase in compression. However, on PFOR, the compression speed increase is limited. From my test, it is due to the sort step which is necessary for compressing each block. PFOR provides a good balance between decompression speed and size, but it is very slow in compression (it is as slow as Rice). I don't think it is possible to improve the compression speed due to the sort step, which seems to impose a hard upper limit in term of performance (I have tried various heuristics to avoid sorting, but not very successful so far). Following this first benchmark, I have created various Lucene codec that uses the *IndexInput and *IndexOutput of the different compression algorithms. I have indexed the full wikipedia dataset. Each block-based codec is configured to use block of 1024. I have recorded the average commit time and the optimise time. Then, I have executing random keyword queries and I have recorded the average query time and the number of bytes read for answering the query. Compression Time is in ms, Size in bytes. Codec Avg Commit Time Total Commit Time Optimise Time Index Size STD 1276 359978 113386 1423251 SEP 1437 405286 117163 1666870 VINT 1572 426551 122557 1742083 RICE 1791 505312 230636 1339703 S9 1575 444157 146216 1530666 FOR 1520 428847 134502 1754578 PFOR 1852 522401 189262 1511154 STD is the Standard lucene codec. SEP the Sep codec. All the other algorithms are based on the Sep codec, but block-based. FOR and PFOR is my implementation. We can see that the sep codec and block-based codecs have a certain overhead (in indexing time and index size) compared to the standard lucene codec. But, I imagine that this overhead can be reduced with some code optimisation. For the index size, the overhead in block-based codeca is due to the skip list, which is much bigger (127MB for SEP against 189MB for block-based) since we need to encode the block offset, and the inner document offset in the block. In term of compression speed and index size, we can see that this results follows the previous results. We can observe also that PFOR is the slowest. Decompression For each codec, I have performed five runs, each run having 200 random keyword queries, and average the time over the 5 runs and 200 queries (i.e., 200*5 = 1000 query execution). For each run, I am opening a new IndexReader and IndexSearcher and I am reusing them across the 200 random queries. To generate the queries, I first grouped the terms into three group: HIGH, MEDIUM and LOW, based on their frequency, in order to be able to generate random keyword queries based on this three frequency groups. Then, I have performed benchmarks with random queries of one, two or more keywords, using all possible combinations (e.g., HIGH & HIGH, HIGH & MEDIUM, etc). I have executed the queries using a single thread, and also using multiple threads. I am reporting below a few of query results, but for most of the other queries, the results were following the same scheme. I don't report the results of the Standard and Sep codec, but they always outperform block-based codec, whatever compression used. However, the results for the query time are completely contradictory with the results of the first benchmark. The slowest algorithm (Rice, Vint) generally provides the faster query execution time, and the faster algorithm (FOR, PFOR) provides the slowest query execution time. Simple9 seems to be as fast as VInt. I currently have no explanation for that. Maybe the dataset (wikipedia articles) is too small to see the benefits. But I would say instead there is a problem somewhere. How come VInt and even Rice could provide better query execution times than FOR and PFOR ? While, in the first benchmark, it is clear that FOR and PFOR provides nearly 40% of decompression performance increase. Do you have some advices on how to improve the benchmark, or some ideas on the causes of this frop of performance ? I'll be able to re-perform the benchmarks if you propose something. HIGH:MUST - 1 thread - 200 random queries Codec Avg Query Time Rice 1.5 ms VInt 1.2 ms PFOR 2.3 ms FOR 2.5 ms S9 1 ms HIGH:MUST - 2 thread - 200 random queries for each thread Codec Avg Query Time Rice 2 ms VInt 1.9 ms PFOR 3 ms FOR 3.5 ms S9 1.6 ms HIGH:MUST HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD - 1 thread - 200 random queries Codec Avg Query Time Rice 1.5 ms VInt 1.2 ms PFOR 2.7 ms FOR 2.5 ms S9 1.3 ms HIGH:MUST HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD - 2 threads - 200 random queries for each thread Rice 2.5 ms VInt 2 ms PFOR 3 ms FOR 3.4 ms S9 2 ms
        Hide
        Michael McCandless added a comment -

        Thanks for sharing these thorough results Renaud!

        Curious that PFOR/FOR don't do well during searching... have you tried profiling? Maybe something silly is going one.

        One issue is MUST mixed with other clauses – the scoring for such a query will do alot of seeking, which for block based codecs will be costly. But it's still strange you don't see speedups for single term query. Have you tried only SHOULD clauses?

        Show
        Michael McCandless added a comment - Thanks for sharing these thorough results Renaud! Curious that PFOR/FOR don't do well during searching... have you tried profiling? Maybe something silly is going one. One issue is MUST mixed with other clauses – the scoring for such a query will do alot of seeking, which for block based codecs will be costly. But it's still strange you don't see speedups for single term query. Have you tried only SHOULD clauses?
        Hide
        Renaud Delbru added a comment -

        Curious that PFOR/FOR don't do well during searching... have you tried profiling? Maybe something silly is going one.

        I will try profiling. But I am surprised, because I am using the same *IndexInput and *IndexOutut than for the first benchmark. So, if there is a problem, it should be "outside" the indexinput.
        But, I'll double check.

        One issue is MUST mixed with other clauses - the scoring for such a query will do alot of seeking, which for block based codecs will be costly. But it's still strange you don't see speedups for single term query. Have you tried only SHOULD clauses?

        Here is the results with only SHOULD clause.

        HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD - 1 thread - 200 random queries
        Codec Avg Query Time
        Rice 6 ms
        VInt 5 ms
        PFOR 6.5 ms
        FOR 6.8 ms
        S9 4.7 ms
        Show
        Renaud Delbru added a comment - Curious that PFOR/FOR don't do well during searching... have you tried profiling? Maybe something silly is going one. I will try profiling. But I am surprised, because I am using the same *IndexInput and *IndexOutut than for the first benchmark. So, if there is a problem, it should be "outside" the indexinput. But, I'll double check. One issue is MUST mixed with other clauses - the scoring for such a query will do alot of seeking, which for block based codecs will be costly. But it's still strange you don't see speedups for single term query. Have you tried only SHOULD clauses? Here is the results with only SHOULD clause. HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD HIGH:SHOULD - 1 thread - 200 random queries Codec Avg Query Time Rice 6 ms VInt 5 ms PFOR 6.5 ms FOR 6.8 ms S9 4.7 ms
        Hide
        Paul Elschot added a comment -

        I think the mixed performance results for decompression and query times may be caused by the use of only a single method. For very short sequences (1 to 2 or 3 integers), I would expect VInt (actually VByte) to perform best. For long sequences (from about 25 integers) , (P)FOR should do best. In between the two, (a variant of) S9.
        The problem will be to find the optimal bordering sequence sizes to change the compression method.

        The fact that S9 is already doing better than VInt is encouraging. Since (P)FOR can do even better than S9, when using (P)FOR only for longer sequences, I'd expect a real performance boost for queries using frequently occurring terms in the index.

        Also, I'd recommend to verify query results for each method. S9 as I implemented it is only tested by its own test cases.
        When the query results are incorrect, measuring performance is not really useful, and this has happened already for the PFOR implementation here, see above in early October 2008.

        Show
        Paul Elschot added a comment - I think the mixed performance results for decompression and query times may be caused by the use of only a single method. For very short sequences (1 to 2 or 3 integers), I would expect VInt (actually VByte) to perform best. For long sequences (from about 25 integers) , (P)FOR should do best. In between the two, (a variant of) S9. The problem will be to find the optimal bordering sequence sizes to change the compression method. The fact that S9 is already doing better than VInt is encouraging. Since (P)FOR can do even better than S9, when using (P)FOR only for longer sequences, I'd expect a real performance boost for queries using frequently occurring terms in the index. Also, I'd recommend to verify query results for each method. S9 as I implemented it is only tested by its own test cases. When the query results are incorrect, measuring performance is not really useful, and this has happened already for the PFOR implementation here, see above in early October 2008.
        Hide
        Renaud Delbru added a comment -

        The fact that S9 is already doing better than VInt is encouraging. Since (P)FOR can do even better than S9, when using (P)FOR only for longer sequences, I'd expect a real performance boost for queries using frequently occurring terms in the index.

        So, you're suggesting that maybe in my benchmark, the decoded sequence are not long enough to see the performance improvements of FOR/PFOR ?

        Also, in my benchmark, VINT means block-based VInt. So, there is the same overhead, i.e., decompress the full block even if a partial sequence is needed, than for FOR, PFOR, S9 and the others. But, even with these settings, as you see, the average query time is smaller when using VInt than FOR/PFOR.

        Also, I'd recommend to verify query results for each method. S9 as I implemented it is only tested by its own test cases.

        We implemented more unit tests for each codec. Also, in the benchmark, I output the number of hits for each query per codec, and the number of hits for each codec is the same (but I haven't checked if they returns all the same document ids).

        Show
        Renaud Delbru added a comment - The fact that S9 is already doing better than VInt is encouraging. Since (P)FOR can do even better than S9, when using (P)FOR only for longer sequences, I'd expect a real performance boost for queries using frequently occurring terms in the index. So, you're suggesting that maybe in my benchmark, the decoded sequence are not long enough to see the performance improvements of FOR/PFOR ? Also, in my benchmark, VINT means block-based VInt. So, there is the same overhead, i.e., decompress the full block even if a partial sequence is needed, than for FOR, PFOR, S9 and the others. But, even with these settings, as you see, the average query time is smaller when using VInt than FOR/PFOR. Also, I'd recommend to verify query results for each method. S9 as I implemented it is only tested by its own test cases. We implemented more unit tests for each codec. Also, in the benchmark, I output the number of hits for each query per codec, and the number of hits for each codec is the same (but I haven't checked if they returns all the same document ids).
        Hide
        Renaud Delbru added a comment -

        Just an update,

        I just executed the same benchmark, but instead of executing only 200 queries per run, I have executed 1000 queries, and it seems that the results of FOR and PFOR are better. Maybe, this is again an effect of the JIT. 200 queries per run were not enough to see the effect of the JIT, and therefore FOR and PFOR were penalized.

        I'll perform again the full benchmarks with a larger number of random queries, and report to you the results (hopefully good results).

        Show
        Renaud Delbru added a comment - Just an update, I just executed the same benchmark, but instead of executing only 200 queries per run, I have executed 1000 queries, and it seems that the results of FOR and PFOR are better. Maybe, this is again an effect of the JIT. 200 queries per run were not enough to see the effect of the JIT, and therefore FOR and PFOR were penalized. I'll perform again the full benchmarks with a larger number of random queries, and report to you the results (hopefully good results).
        Hide
        Michael McCandless added a comment -

        Attached patch, just modernizing and merging all prior patches into a single one. All tests pass with the PForDelta codec.

        Show
        Michael McCandless added a comment - Attached patch, just modernizing and merging all prior patches into a single one. All tests pass with the PForDelta codec.
        Hide
        Paul Elschot added a comment -

        I'm sorry that there is no code yet for a better patching implementation, see my remark of 12 May 2009. This would need some version of Simple9 and I'm still pondering a generalization of that, but I have no time plan for finishing it.
        A rough implementation might just use vByte for such patches.

        Show
        Paul Elschot added a comment - I'm sorry that there is no code yet for a better patching implementation, see my remark of 12 May 2009. This would need some version of Simple9 and I'm still pondering a generalization of that, but I have no time plan for finishing it. A rough implementation might just use vByte for such patches.
        Hide
        Michael McCandless added a comment -

        Very rough patch with some non-committable optimizations.

        The biggest issue was that we were not using the buik-read API when scoring, because the doc IDs are deltas and because we had to take deletions into account. I hacked around this (and other issues)...

        I ran some perf tests, described at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html

        The results are encouraging! FOR/PFOR gets sizable gains for certain queries.

        Show
        Michael McCandless added a comment - Very rough patch with some non-committable optimizations. The biggest issue was that we were not using the buik-read API when scoring, because the doc IDs are deltas and because we had to take deletions into account. I hacked around this (and other issues)... I ran some perf tests, described at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html The results are encouraging! FOR/PFOR gets sizable gains for certain queries.
        Hide
        Michael Busch added a comment -

        Nice blog post, Mike!

        Show
        Michael Busch added a comment - Nice blog post, Mike!
        Hide
        Michael McCandless added a comment -

        OK I committed the prototype impl onto the bulk postings branch.

        Show
        Michael McCandless added a comment - OK I committed the prototype impl onto the bulk postings branch.
        Hide
        hao yan added a comment -

        This patch is to add codec support for PForDelta compression algorithms.

        Changes by Hao Yan (hyan2008@gmail.com)

        In summary, I added five files to support and test the codec.

        In Src,
        1. org.apache.lucene.index.codecs.pfordelta.PForDelta.java
        2. org.apache.lucene.index.codecs.pfordelta.Simple16.java
        3. org.apache.lucene.index.codecs.PForDeltaFixedBlockCodec.java
        4. org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutputWithGetElementNum.java

        In Test,
        5. org.apache.lucene.index.codecs.intblock.TestPForDeltaFixedIntBLockCodec.java

        1) In particular, the firs class PForDelta is the core implementation
        of PForDelta algorithm, which compresses exceptions using Simple16
        that is implemented in the second class Simple16.
        2) The third classs PForDeltaFixedBlockCodec is similar to
        org.apache.lucene.index.codesc.ockintblock.MockFixedIntBlockCodec in
        Test, except that it uses PForDelta to encode the data in the buffer.
        3) The fourth class is almost the same as
        org.apache.lucene.index.codecs.intblock.FixedIntBlockINdexOuput,
        except that it provides an additional public function to retrieve the
        value of the upto field, which is private filed in
        FixedIntBlockINdexOuput. The reason I added this public function is
        that the number of elements in the block that have meaningful values is not always equal to the blockSize or the buffer
        size since the last block/buffer of a stream of data usually only
        contain less number of data. In the case, I will fill all elements after the meaningful elements with 0s. Thus, we alwasy compress one entire block.

        4) The last class is the unit test to test PForDeltaFixedIntBlockCodec
        which is very similar to
        org.apache.lucene.index.codecs.mintblock.TestIntBlockCodec.

        I also changed the LuceneTestCase class to add the new
        PForDeltaFixeIntBlockCOde.

        The unit tests and all lucence tests have passed.

        Show
        hao yan added a comment - This patch is to add codec support for PForDelta compression algorithms. Changes by Hao Yan (hyan2008@gmail.com) In summary, I added five files to support and test the codec. In Src, 1. org.apache.lucene.index.codecs.pfordelta.PForDelta.java 2. org.apache.lucene.index.codecs.pfordelta.Simple16.java 3. org.apache.lucene.index.codecs.PForDeltaFixedBlockCodec.java 4. org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutputWithGetElementNum.java In Test, 5. org.apache.lucene.index.codecs.intblock.TestPForDeltaFixedIntBLockCodec.java 1) In particular, the firs class PForDelta is the core implementation of PForDelta algorithm, which compresses exceptions using Simple16 that is implemented in the second class Simple16. 2) The third classs PForDeltaFixedBlockCodec is similar to org.apache.lucene.index.codesc.ockintblock.MockFixedIntBlockCodec in Test, except that it uses PForDelta to encode the data in the buffer. 3) The fourth class is almost the same as org.apache.lucene.index.codecs.intblock.FixedIntBlockINdexOuput, except that it provides an additional public function to retrieve the value of the upto field, which is private filed in FixedIntBlockINdexOuput. The reason I added this public function is that the number of elements in the block that have meaningful values is not always equal to the blockSize or the buffer size since the last block/buffer of a stream of data usually only contain less number of data. In the case, I will fill all elements after the meaningful elements with 0s. Thus, we alwasy compress one entire block. 4) The last class is the unit test to test PForDeltaFixedIntBlockCodec which is very similar to org.apache.lucene.index.codecs.mintblock.TestIntBlockCodec. I also changed the LuceneTestCase class to add the new PForDeltaFixeIntBlockCOde. The unit tests and all lucence tests have passed.
        Hide
        Michael McCandless added a comment -

        Thanks for the new PForDelta impl! Now we are swimming in int
        encoders...

        I attached a new patch, to get this patch working on the bulkpostings
        branch
        (https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings).
        I also swapped in assert / throw execption instead of
        System.out.println in some places. All lucene core tests pass w/ this
        codec.

        Some notes/questions:

        • The compress looks somewhat costly? You're allocating several new
          int[], and also the search for the best bit size looks costly
          (though, so does the search for the other prototype pfor we have,
          committed on the branch).
        • Decompress also allocates several new int[]. I think we have to
          reuse this int[] buffers and minimize copying to get good perf...
        • The compressor was assuming that the compressed ints would always
          fit inside the input buffer, but in my case anyway this was not
          true. I'm using a block size of 128 and in one case it wanted 136
          ints in the output, which seems odd since it should just leave the
          ints uncompressed in that case and use only 129 ints (+1 int for
          the header). I changed the code to just return the compressed
          buffer instead of trying to copy over the input buffer.
        • It's nice that this new pfor impl handles all 0s, but, Lucene will
          never hit this (at least today; if we subtracted 1 from our freqs
          then we may hit it). So we can same some CPU by not checking if
          bits==0. We can also avoid null checks, checking for input length
          0, etc. – we can trade this safety for perf. Such checks can be
          done as asserts instead.
        • How come only certain bit sizes are valid? EG 17,18,19 all seem
          to round up to 20?
        • How come bits cannot go up to 31? Or maybe you just use a full
          int if it's over 28? Seems like a good idea...
        • It's neat that you separately encode exc positions & values (high
          bits), leaving low bits in the slot. Does this give better perf
          than linking them together? (I think pfor1 links).

        Although all tests pass if I run w/ -Dtests.codec=PatchedFrameOfRef2,
        if I try to build a big wikipedia index I hit this:

        java.lang.ArrayIndexOutOfBoundsException: 386
        	at org.apache.lucene.util.pfor2.Simple16.s16Compress(Simple16.java:68)
        	at org.apache.lucene.util.pfor2.PForDelta.compressBlockByS16(PForDelta.java:270)
        	at org.apache.lucene.util.pfor2.PForDelta.compressOneBlockCore(PForDelta.java:221)
        	at org.apache.lucene.util.pfor2.PForDelta.compressOneBlock(PForDelta.java:82)
        	at org.apache.lucene.index.codecs.pfordelta2.PForDeltaFixedIntBlockCodec.encodeOneBlockWithPForDelta(PForDeltaFixedIntBlockCodec.java:87)
        	at org.apache.lucene.index.codecs.pfordelta2.PForDeltaFixedIntBlockCodec$PForDeltaIntFactory$2.flushBlock(PForDeltaFixedIntBlockCodec.java:151)
        	at org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutput.write(FixedIntBlockIndexOutput.java:129)
        	at org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:204)
        	at org.apache.lucene.index.codecs.PostingsConsumer.merge(PostingsConsumer.java:79)
        	at org.apache.lucene.index.codecs.TermsConsumer.merge(TermsConsumer.java:97)
        	at org.apache.lucene.index.codecs.FieldsConsumer.merge(FieldsConsumer.java:49)
        	at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:634)
        	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:147)
        	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3281)
        	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2804)
        	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407)
        

        But, then I remembered, I think Simple16 cannot represent values >=
        2^28? Which is a problem here since in general Lucene indexes can
        have such values (though they should be rarish). Maybe I'm hitting
        one here and this is what happens?

        Anyway, I think somehow we need to merge pfor1 and pfor2 to get the
        best of both words. I like that pfor2 uses Simple16 for encoding
        exceptions, and how it encodes exceptions (if indeed this is better
        perf), but I like that pfor1 (currently committed on the branch) is
        fast decode (specializes the N bit cases) and allows pure FOR (ie no
        exceptions, you encode to max bit size, which wastes bits but is
        fast).

        I'd like to commit this as "pfor2" (on the bulkpostings branch); over
        time we can iterate to merge the two impls, before landing on trunk.

        Show
        Michael McCandless added a comment - Thanks for the new PForDelta impl! Now we are swimming in int encoders... I attached a new patch, to get this patch working on the bulkpostings branch ( https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings ). I also swapped in assert / throw execption instead of System.out.println in some places. All lucene core tests pass w/ this codec. Some notes/questions: The compress looks somewhat costly? You're allocating several new int[], and also the search for the best bit size looks costly (though, so does the search for the other prototype pfor we have, committed on the branch). Decompress also allocates several new int[]. I think we have to reuse this int[] buffers and minimize copying to get good perf... The compressor was assuming that the compressed ints would always fit inside the input buffer, but in my case anyway this was not true. I'm using a block size of 128 and in one case it wanted 136 ints in the output, which seems odd since it should just leave the ints uncompressed in that case and use only 129 ints (+1 int for the header). I changed the code to just return the compressed buffer instead of trying to copy over the input buffer. It's nice that this new pfor impl handles all 0s, but, Lucene will never hit this (at least today; if we subtracted 1 from our freqs then we may hit it). So we can same some CPU by not checking if bits==0. We can also avoid null checks, checking for input length 0, etc. – we can trade this safety for perf. Such checks can be done as asserts instead. How come only certain bit sizes are valid? EG 17,18,19 all seem to round up to 20? How come bits cannot go up to 31? Or maybe you just use a full int if it's over 28? Seems like a good idea... It's neat that you separately encode exc positions & values (high bits), leaving low bits in the slot. Does this give better perf than linking them together? (I think pfor1 links). Although all tests pass if I run w/ -Dtests.codec=PatchedFrameOfRef2, if I try to build a big wikipedia index I hit this: java.lang.ArrayIndexOutOfBoundsException: 386 at org.apache.lucene.util.pfor2.Simple16.s16Compress(Simple16.java:68) at org.apache.lucene.util.pfor2.PForDelta.compressBlockByS16(PForDelta.java:270) at org.apache.lucene.util.pfor2.PForDelta.compressOneBlockCore(PForDelta.java:221) at org.apache.lucene.util.pfor2.PForDelta.compressOneBlock(PForDelta.java:82) at org.apache.lucene.index.codecs.pfordelta2.PForDeltaFixedIntBlockCodec.encodeOneBlockWithPForDelta(PForDeltaFixedIntBlockCodec.java:87) at org.apache.lucene.index.codecs.pfordelta2.PForDeltaFixedIntBlockCodec$PForDeltaIntFactory$2.flushBlock(PForDeltaFixedIntBlockCodec.java:151) at org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutput.write(FixedIntBlockIndexOutput.java:129) at org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:204) at org.apache.lucene.index.codecs.PostingsConsumer.merge(PostingsConsumer.java:79) at org.apache.lucene.index.codecs.TermsConsumer.merge(TermsConsumer.java:97) at org.apache.lucene.index.codecs.FieldsConsumer.merge(FieldsConsumer.java:49) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:634) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:147) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3281) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2804) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407) But, then I remembered, I think Simple16 cannot represent values >= 2^28? Which is a problem here since in general Lucene indexes can have such values (though they should be rarish). Maybe I'm hitting one here and this is what happens? Anyway, I think somehow we need to merge pfor1 and pfor2 to get the best of both words. I like that pfor2 uses Simple16 for encoding exceptions, and how it encodes exceptions (if indeed this is better perf), but I like that pfor1 (currently committed on the branch) is fast decode (specializes the N bit cases) and allows pure FOR (ie no exceptions, you encode to max bit size, which wastes bits but is fast). I'd like to commit this as "pfor2" (on the bulkpostings branch); over time we can iterate to merge the two impls, before landing on trunk.
        Hide
        Michael McCandless added a comment -

        Also VSEncoding
        (http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html)
        looks very interesting – faster that PForDelta!

        Show
        Michael McCandless added a comment - Also VSEncoding ( http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html ) looks very interesting – faster that PForDelta!
        Hide
        Robert Muir added a comment -

        How come bits cannot go up to 31? Or maybe you just use a full int if it's over 28? Seems like a good idea...

        Seems like we need to solve this with simple9/simple16 too?

        Although all tests pass if I run w/ -Dtests.codec=PatchedFrameOfRef2,

        if I try to build a big wikipedia index I hit this:

        Mike, I've encountered this problem myself while messing with for/pfor.
        I know for these things we need low-level unit tests, but can we cheat in some way?

        Like a random test that encodes/decodes a ton of integers (including things that would be rare deltas)
        via the codec API?

        Show
        Robert Muir added a comment - How come bits cannot go up to 31? Or maybe you just use a full int if it's over 28? Seems like a good idea... Seems like we need to solve this with simple9/simple16 too? Although all tests pass if I run w/ -Dtests.codec=PatchedFrameOfRef2, if I try to build a big wikipedia index I hit this: Mike, I've encountered this problem myself while messing with for/pfor. I know for these things we need low-level unit tests, but can we cheat in some way? Like a random test that encodes/decodes a ton of integers (including things that would be rare deltas) via the codec API?
        Hide
        Michael McCandless added a comment -

        Seems like we need to solve this with simple9/simple16 too?

        Yes!

        Like a random test that encodes/decodes a ton of integers (including things that would be rare deltas)
        via the codec API?

        I completely agree: we need heavy low-level tests for the int encoders... I'll stick a nocommit in when I commit!

        Show
        Michael McCandless added a comment - Seems like we need to solve this with simple9/simple16 too? Yes! Like a random test that encodes/decodes a ton of integers (including things that would be rare deltas) via the codec API? I completely agree: we need heavy low-level tests for the int encoders... I'll stick a nocommit in when I commit!
        Hide
        Paul Elschot added a comment -

        ... separately encode exc positions & values (high bits), leaving low bits in the slot. Does this give better perf than linking them together? (I think pfor1 links).

        This saves forced exceptions for low numbers of frame bits, so the treatment of exceptions is cleaner.

        Simple16 cannot represent values >= 2^28?

        A simple solution is to treat the high bit of the 28 bit value just the same as in vByte, and allow a vByte to follow in the 28 bit case. The high bit can also be added to the selector easily to avoid testing for it.

        Show
        Paul Elschot added a comment - ... separately encode exc positions & values (high bits), leaving low bits in the slot. Does this give better perf than linking them together? (I think pfor1 links). This saves forced exceptions for low numbers of frame bits, so the treatment of exceptions is cleaner. Simple16 cannot represent values >= 2^28? A simple solution is to treat the high bit of the 28 bit value just the same as in vByte, and allow a vByte to follow in the 28 bit case. The high bit can also be added to the selector easily to avoid testing for it.
        Hide
        Michael McCandless added a comment -

        A simple solution is to treat the high bit of the 28 bit value just the same as in vByte, and allow a vByte to follow in the 28 bit case. The high bit can also be added to the selector easily to avoid testing for it.

        That sounds great! Any chance you could fix this up on one of the Simple16 impls? I'd really like to have a Simple9/16 codec to better test our variable int block codec infrastructure...

        Show
        Michael McCandless added a comment - A simple solution is to treat the high bit of the 28 bit value just the same as in vByte, and allow a vByte to follow in the 28 bit case. The high bit can also be added to the selector easily to avoid testing for it. That sounds great! Any chance you could fix this up on one of the Simple16 impls? I'd really like to have a Simple9/16 codec to better test our variable int block codec infrastructure...
        Hide
        Michael McCandless added a comment -

        OK I committed "pfor2" onto the branch.

        I also add a new low-level "encode random ints" tests. pfor1 passes the test but pfor2 fails it (I'm guessing this is the 2^28 limitation of Simple16, but I'm really not sure), so I left the pfor2 random ints test @Ignore for now...

        Show
        Michael McCandless added a comment - OK I committed "pfor2" onto the branch. I also add a new low-level "encode random ints" tests. pfor1 passes the test but pfor2 fails it (I'm guessing this is the 2^28 limitation of Simple16, but I'm really not sure), so I left the pfor2 random ints test @Ignore for now...
        Hide
        Paul Elschot added a comment -

        In the 1410e patch here are test cases that have not made into the bulkpostings branch. I'll try and revive these first.

        Show
        Paul Elschot added a comment - In the 1410e patch here are test cases that have not made into the bulkpostings branch. I'll try and revive these first.
        Hide
        Michael McCandless added a comment -

        In the 1410e patch here are test cases that have not made into the bulkpostings branch. I'll try and revive these first.

        Ugh, sorry Thanks!

        Show
        Michael McCandless added a comment - In the 1410e patch here are test cases that have not made into the bulkpostings branch. I'll try and revive these first. Ugh, sorry Thanks!
        Hide
        Paul Elschot added a comment -

        No need to be sorry. Thanks for taking this on.

        Show
        Paul Elschot added a comment - No need to be sorry. Thanks for taking this on.
        Hide
        Robert Muir added a comment -

        On LUCENE-2723, I uploaded a "bulk vint" codec that shares most of the same codepath as FOR/PFOR,
        except it writes blocks of 128 vint-encoded integers.

        There are performance numbers there compared to our Standard vint-based codec, as you can see
        it differs dramatically due to other reasons.

        So I thought it would be useful to then compare FOR to this, since its a good measure of just the compression
        algorithm, but everything else is the same (comparing two 128-block size FixedIntBlock codecs with the same
        index layout, etc etc). This way we compare apples to apples.

        Query QPS BulkVInt QPS FOR Pct diff
        united~1.0 9.43 9.39 -0.5%
        united~2.0 2.02 2.02 -0.3%
        unit~1.0 6.37 6.36 -0.1%
        unit~2.0 6.13 6.21 1.2%
        "unit state"~3 3.45 3.51 2.0%
        spanNear([unit, state], 10, true) 2.89 2.99 3.3%
        unit* 30.04 31.42 4.6%
        unit state 8.00 8.40 5.0%
        "unit state" 5.97 6.37 6.7%
        spanFirst(unit, 5) 11.29 12.10 7.2%
        uni* 17.36 18.69 7.6%
        +unit +state 10.99 12.18 10.8%
        +nebraska +state 65.74 73.06 11.1%
        state 28.90 32.37 12.0%
        u*d 10.54 12.45 18.1%
        un*d 40.06 47.61 18.9%
        Show
        Robert Muir added a comment - On LUCENE-2723 , I uploaded a "bulk vint" codec that shares most of the same codepath as FOR/PFOR, except it writes blocks of 128 vint-encoded integers. There are performance numbers there compared to our Standard vint-based codec, as you can see it differs dramatically due to other reasons. So I thought it would be useful to then compare FOR to this, since its a good measure of just the compression algorithm, but everything else is the same (comparing two 128-block size FixedIntBlock codecs with the same index layout, etc etc). This way we compare apples to apples. Query QPS BulkVInt QPS FOR Pct diff united~1.0 9.43 9.39 -0.5% united~2.0 2.02 2.02 -0.3% unit~1.0 6.37 6.36 -0.1% unit~2.0 6.13 6.21 1.2% "unit state"~3 3.45 3.51 2.0% spanNear( [unit, state] , 10, true) 2.89 2.99 3.3% unit* 30.04 31.42 4.6% unit state 8.00 8.40 5.0% "unit state" 5.97 6.37 6.7% spanFirst(unit, 5) 11.29 12.10 7.2% uni* 17.36 18.69 7.6% +unit +state 10.99 12.18 10.8% +nebraska +state 65.74 73.06 11.1% state 28.90 32.37 12.0% u*d 10.54 12.45 18.1% un*d 40.06 47.61 18.9%
        Hide
        Paul Elschot added a comment -

        I had a quick look at the codecs for this, but I couldn't find the answer to this question easily:
        Are the positions here also encoded by the bulk int encoders (VInt and FOR)?

        Show
        Paul Elschot added a comment - I had a quick look at the codecs for this, but I couldn't find the answer to this question easily: Are the positions here also encoded by the bulk int encoders (VInt and FOR)?
        Hide
        Michael McCandless added a comment -

        Yes positions are bulk coded too, but we haven't cutover any positional queries yet to use the bulk enum API... we should cutover at least one (I think exact PhraseQuery is probably easiest!).

        Show
        Michael McCandless added a comment - Yes positions are bulk coded too, but we haven't cutover any positional queries yet to use the bulk enum API... we should cutover at least one (I think exact PhraseQuery is probably easiest!).
        Hide
        Paul Elschot added a comment -

        I'm running into a nocommit for the nio byte buffer allocation in ForDecompress.java.
        Shall I try and move the buffer handling from there into FORIndexInput and PForDeltaIndexInput at the codecs?
        I could leave it as it is, but then the test cases from the 1410e patch would have to be adapted again when the nocommit is fixed.

        Also the package/directory naming o.a.l.util.pfor and o.a.l.index.codecs.pfordelta may be confusing.
        Probably pfordelta could could be renamed to pfor, since delta refers to differences (in docids and positions) that are treated elsewhere.
        But I'd rather not change that now.

        Show
        Paul Elschot added a comment - I'm running into a nocommit for the nio byte buffer allocation in ForDecompress.java. Shall I try and move the buffer handling from there into FORIndexInput and PForDeltaIndexInput at the codecs? I could leave it as it is, but then the test cases from the 1410e patch would have to be adapted again when the nocommit is fixed. Also the package/directory naming o.a.l.util.pfor and o.a.l.index.codecs.pfordelta may be confusing. Probably pfordelta could could be renamed to pfor, since delta refers to differences (in docids and positions) that are treated elsewhere. But I'd rather not change that now.
        Hide
        Robert Muir added a comment -

        I'm running into a nocommit for the nio byte buffer allocation in ForDecompress.java.
        Shall I try and move the buffer handling from there into FORIndexInput and PForDeltaIndexInput at the codecs?

        I am to blame for this I think! Actually I think the buffer handling could stay and we could just remove the nocommit?
        I've tested everything I can think of and it seems this nio ByteBuffer/IntBuffer approach is always the fastest:
        its only slower to do it other ways, and it doesnt help to do trickier things like IntBuffer views of MMap even.

        One thing that would be good, is it possible to encode the length in decompressed bytes (or the length in bytes of exceptions)
        into PFOR's int header? this would allow us to remove the wasted per-block int that we currently encode now.

        Then we could "put FOR and PFOR back together" again... sorry i split apart the decompressors to remove the wasted int
        in the FOR case since we can get it from its header already.

        Show
        Robert Muir added a comment - I'm running into a nocommit for the nio byte buffer allocation in ForDecompress.java. Shall I try and move the buffer handling from there into FORIndexInput and PForDeltaIndexInput at the codecs? I am to blame for this I think! Actually I think the buffer handling could stay and we could just remove the nocommit? I've tested everything I can think of and it seems this nio ByteBuffer/IntBuffer approach is always the fastest: its only slower to do it other ways, and it doesnt help to do trickier things like IntBuffer views of MMap even. One thing that would be good, is it possible to encode the length in decompressed bytes (or the length in bytes of exceptions) into PFOR's int header? this would allow us to remove the wasted per-block int that we currently encode now. Then we could "put FOR and PFOR back together" again... sorry i split apart the decompressors to remove the wasted int in the FOR case since we can get it from its header already.
        Hide
        Robert Muir added a comment -

        Sorry, correction (i meant the length in bytes or ints compressed, to tell us how many bytes to read)

        In the FOR case we now do:

            int header = in.readInt();
            final int numFrameBits = ((header >>> 8) & 31) + 1;
            in.readBytes(input, 0, numFrameBits << 4);
        

        But in PFOR we still have "two headers"

            int numBytes = in.readInt(); // nocommit: is it possible to encode # of exception bytes in header?
            in.readBytes(input, 0, numBytes);
            compressedBuffer.rewind();
            int header = compressedBuffer.get();
        
        Show
        Robert Muir added a comment - Sorry, correction (i meant the length in bytes or ints compressed , to tell us how many bytes to read) In the FOR case we now do: int header = in.readInt(); final int numFrameBits = ((header >>> 8) & 31) + 1; in.readBytes(input, 0, numFrameBits << 4); But in PFOR we still have "two headers" int numBytes = in.readInt(); // nocommit: is it possible to encode # of exception bytes in header? in.readBytes(input, 0, numBytes); compressedBuffer.rewind(); int header = compressedBuffer.get();
        Hide
        Paul Elschot added a comment - - edited

        ... is it possible to encode # of exception bytes in header?

        In the first implementation the start index of the exception chain is in the header (5 or 6 bits iirc). In the second implementation (by Hoa Yan) there is no exception chain, so the number of exceptions must somehow be encoded in the header.
        That means encoding the # exception bytes in the header would be easier in the second implementation, but it is also possible in the first one.

        I would expect that a few bits for the number of encoded integers would also be added in the header (think 32, 64, 128...).
        The number of frame bits takes 5 bits.
        That means that there are about 2 bytes unused in the header now, and I'd expect 1 byte to be enough to encode the number of bytes for the exceptions. For example a bad case in the first implementation of 10 exceptions of 4 bytes means 40 bytes data, that fits in 6 bits, the same
        bad case in the second implementation would also need to store the indexes of the exceptions in 10*5 bits, for a total of about 48 bytes that can be still be encoded in 6 bits. However, I don't know what the worst case # exceptions is. (This gets into vsencoding...)

        For the moment I'll just leave this unchanged and get the tests working on the current first implementation.

        Show
        Paul Elschot added a comment - - edited ... is it possible to encode # of exception bytes in header? In the first implementation the start index of the exception chain is in the header (5 or 6 bits iirc). In the second implementation (by Hoa Yan) there is no exception chain, so the number of exceptions must somehow be encoded in the header. That means encoding the # exception bytes in the header would be easier in the second implementation, but it is also possible in the first one. I would expect that a few bits for the number of encoded integers would also be added in the header (think 32, 64, 128...). The number of frame bits takes 5 bits. That means that there are about 2 bytes unused in the header now, and I'd expect 1 byte to be enough to encode the number of bytes for the exceptions. For example a bad case in the first implementation of 10 exceptions of 4 bytes means 40 bytes data, that fits in 6 bits, the same bad case in the second implementation would also need to store the indexes of the exceptions in 10*5 bits, for a total of about 48 bytes that can be still be encoded in 6 bits. However, I don't know what the worst case # exceptions is. (This gets into vsencoding...) For the moment I'll just leave this unchanged and get the tests working on the current first implementation.
        Hide
        Paul Elschot added a comment -

        I've tested everything I can think of and it seems this nio ByteBuffer/IntBuffer approach is always the fastest ...

        Did you also test without a copy (without the readbytes() call) into the underlying byte array for the IntBuffer? That might be even faster,
        and it could be possible when using for example a BufferedIndexInput or an MMapDirectory.
        For decent buffer.get() speed the starting byte would need to be aligned at an int border.

        Show
        Paul Elschot added a comment - I've tested everything I can think of and it seems this nio ByteBuffer/IntBuffer approach is always the fastest ... Did you also test without a copy (without the readbytes() call) into the underlying byte array for the IntBuffer? That might be even faster, and it could be possible when using for example a BufferedIndexInput or an MMapDirectory. For decent buffer.get() speed the starting byte would need to be aligned at an int border.
        Hide
        Robert Muir added a comment -

        Did you also test without a copy (without the readbytes() call) into the underlying byte array for the IntBuffer? That might be even faster,
        and it could be possible when using for example a BufferedIndexInput or an MMapDirectory.
        For decent buffer.get() speed the starting byte would need to be aligned at an int border.

        Yes, for the mmap case I tried the original dangerous hack, exposing in Intbuffer view of its internal mapped byte buffer.
        I also tried mmapindexinput keeping track of its own intbuffer view.

        we might be able to have some gains by allowing a directory to return an IntBufferIndexInput of some sort (separate from DataInput/IndexInput)
        that basically just positions an IntBuffer view (the default implementation would fill from an indexinput into a bytebuffer like we do now),
        but I haven't tested this across all the directories yet... it might help NIOFS though as it would bypass the double-buffering of BufferedIndexInput.
        For SimpleFS it would be the same, and for MMap i'm not very hopeful it would be better, but maybe not worse.

        if that worked maybe we could do the same with Long, for things like simple-8b (http://onlinelibrary.wiley.com/doi/10.1002/spe.948/abstract)

        Show
        Robert Muir added a comment - Did you also test without a copy (without the readbytes() call) into the underlying byte array for the IntBuffer? That might be even faster, and it could be possible when using for example a BufferedIndexInput or an MMapDirectory. For decent buffer.get() speed the starting byte would need to be aligned at an int border. Yes, for the mmap case I tried the original dangerous hack, exposing in Intbuffer view of its internal mapped byte buffer. I also tried mmapindexinput keeping track of its own intbuffer view. we might be able to have some gains by allowing a directory to return an IntBufferIndexInput of some sort (separate from DataInput/IndexInput) that basically just positions an IntBuffer view (the default implementation would fill from an indexinput into a bytebuffer like we do now), but I haven't tested this across all the directories yet... it might help NIOFS though as it would bypass the double-buffering of BufferedIndexInput. For SimpleFS it would be the same, and for MMap i'm not very hopeful it would be better, but maybe not worse. if that worked maybe we could do the same with Long, for things like simple-8b ( http://onlinelibrary.wiley.com/doi/10.1002/spe.948/abstract )
        Hide
        Paul Elschot added a comment -

        I tried to revive the tests from the 1410e patch, but it does not make much sense because they test rather short sequences of input to be compressed, and decompressor is now hardcoded to always decompress 128 values.

        Show
        Paul Elschot added a comment - I tried to revive the tests from the 1410e patch, but it does not make much sense because they test rather short sequences of input to be compressed, and decompressor is now hardcoded to always decompress 128 values.
        Hide
        Paul Elschot added a comment -

        .. we might be able to have some gains by allowing a directory to return an IntBufferIndexInput of some sort (separate from DataInput/IndexInput) that basically just positions an IntBuffer view (the default implementation would fill from an indexinput into a bytebuffer like we do now),

        Since things are moving on the nio buffer front (see also LUCENE-2292), how about trying to be independent from the buffer implementation?
        That might be done by allowing an IntBuffer wrapping a byte array or as view alongside a ByteBuffer, or temporary IntBuffer as above.

        To be independent from the buffer implementation we could add some methods to IndexInput:

        void startAlignToInt() // basically a seek to the next multiple of 4 byte when not already there. Could also start using an IntBuffer somehow.
        int readAlignedInt() // get the next int, default to readInt(), use an IntBuffer when available.
        void endAlignToInt() // switch back to byte reading, set the byte buffer to the the position corresponding to the int buffer.

        (Adding this to DataInput seems to be a more natural place, but DataInput cannot seek.)

        Would that work, and could it work fast?

        Show
        Paul Elschot added a comment - .. we might be able to have some gains by allowing a directory to return an IntBufferIndexInput of some sort (separate from DataInput/IndexInput) that basically just positions an IntBuffer view (the default implementation would fill from an indexinput into a bytebuffer like we do now), Since things are moving on the nio buffer front (see also LUCENE-2292 ), how about trying to be independent from the buffer implementation? That might be done by allowing an IntBuffer wrapping a byte array or as view alongside a ByteBuffer, or temporary IntBuffer as above. To be independent from the buffer implementation we could add some methods to IndexInput: void startAlignToInt() // basically a seek to the next multiple of 4 byte when not already there. Could also start using an IntBuffer somehow. int readAlignedInt() // get the next int, default to readInt(), use an IntBuffer when available. void endAlignToInt() // switch back to byte reading, set the byte buffer to the the position corresponding to the int buffer. (Adding this to DataInput seems to be a more natural place, but DataInput cannot seek.) Would that work, and could it work fast?
        Hide
        The Alchemist added a comment -

        Out of curiousity, is the PFOR effort dead? I was thinking about running some newer benchmarks using Java 7, and see if that makes a difference.

        Do you guys think that's worthwhile?

        Show
        The Alchemist added a comment - Out of curiousity, is the PFOR effort dead? I was thinking about running some newer benchmarks using Java 7, and see if that makes a difference. Do you guys think that's worthwhile?
        Hide
        Michael McCandless added a comment -

        Out of curiousity, is the PFOR effort dead?

        Nothing in open source is ever dead! (Well, rarely...). It's just that nobody has picked this up again and pushed it to a committable state.

        I think now that we have no more bulk API in trunk, it may not be that much work to finish... though there could easily be surprises.

        I opened LUCENE-3892 to do exactly this, as a Google Summer of Code project.

        Show
        Michael McCandless added a comment - Out of curiousity, is the PFOR effort dead? Nothing in open source is ever dead! (Well, rarely...). It's just that nobody has picked this up again and pushed it to a committable state. I think now that we have no more bulk API in trunk, it may not be that much work to finish... though there could easily be surprises. I opened LUCENE-3892 to do exactly this, as a Google Summer of Code project.
        Hide
        Paul Elschot added a comment -

        There is no point in continuing this here with LUCENE-3892 well on its way.

        Show
        Paul Elschot added a comment - There is no point in continuing this here with LUCENE-3892 well on its way.
        Hide
        Paul Elschot added a comment -
        Show
        Paul Elschot added a comment - See LUCENE-3892

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 21,840h
              21,840h
              Remaining:
              Remaining Estimate - 21,840h
              21,840h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development