Lucene - Core
  1. Lucene - Core
  2. LUCENE-4609

Write a PackedIntsEncoder/Decoder for facets

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Not a Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders.

      It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something.

      1. LUCENE-4609.patch
        9 kB
        Gilad Barkai
      2. LUCENE-4609.patch
        8 kB
        Michael McCandless
      3. LUCENE-4609.patch
        7 kB
        Michael McCandless
      4. LUCENE-4609.patch
        8 kB
        Michael McCandless
      5. LUCENE-4609.patch
        75 kB
        Adrien Grand
      6. SemiPackedEncoder.patch
        12 kB
        Gilad Barkai

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment - - edited

          I just looked at IntDecoder/IntEncoder and wrapping a PackedInts ReaderIterator/Writer (or maybe directly a Decoder/Encoder) looks easy to implement. Don't hesitate to let me know if you have questions regarding the PackedInts API!

          Show
          Adrien Grand added a comment - - edited I just looked at IntDecoder/IntEncoder and wrapping a PackedInts ReaderIterator/Writer (or maybe directly a Decoder/Encoder) looks easy to implement. Don't hesitate to let me know if you have questions regarding the PackedInts API!
          Hide
          Shai Erera added a comment -

          Thanks Adrien. I would like to implement a specialized Encoder/Decoder, rather than wrapping them w/ PackedInts Reder/Writer. I sure would appreciate your review once I have a patch !

          Show
          Shai Erera added a comment - Thanks Adrien. I would like to implement a specialized Encoder/Decoder, rather than wrapping them w/ PackedInts Reder/Writer. I sure would appreciate your review once I have a patch !
          Hide
          Gilad Barkai added a comment -

          Attached a PackedEncoder, which is based on PackedInts. Currently only the approach of a 'per-document' bits-per-value is implemented.

          I'm not convinced the header could be spared, as at the very least, the number of bits to neglect at the end of the stream should be written. E.g if there are 2 bits per value, and there are 17 values, there's a need for 34 bits, but everything is written in (at least) bytes, so 6 bits should be neglected.

          Updated EncodingTest and EncodingSpeed, and found out that the compression factor is not that good, probably due to large numbers which bumps the amount of required bits to higher value.

          Started to look into a semi-packed encoder, which could encode most values in a packed manner, but could also add large values as, e.g., vints.
          Example: for 6 bits per value, all values 0-62 are packed, while a packed value of 63 (packed all 1' s) is a marker that the next value is written in a non-packed manner (say vint, Elias delta, whole 32 bits.. ).
          This should improve the compression factor when most ints are small, and only a few are large.
          Impact on encoding/decoding speed remains to be seen..

          Show
          Gilad Barkai added a comment - Attached a PackedEncoder, which is based on PackedInts . Currently only the approach of a 'per-document' bits-per-value is implemented. I'm not convinced the header could be spared, as at the very least, the number of bits to neglect at the end of the stream should be written. E.g if there are 2 bits per value, and there are 17 values, there's a need for 34 bits, but everything is written in (at least) bytes, so 6 bits should be neglected. Updated EncodingTest and EncodingSpeed, and found out that the compression factor is not that good, probably due to large numbers which bumps the amount of required bits to higher value. Started to look into a semi-packed encoder, which could encode most values in a packed manner, but could also add large values as, e.g., vints. Example: for 6 bits per value, all values 0-62 are packed, while a packed value of 63 (packed all 1' s) is a marker that the next value is written in a non-packed manner (say vint, Elias delta, whole 32 bits.. ). This should improve the compression factor when most ints are small, and only a few are large. Impact on encoding/decoding speed remains to be seen..
          Hide
          Michael McCandless added a comment -

          Cool! Do you encode the gaps or the straight up ords? (Looks like straight up ords?).

          Started to look into a semi-packed encoder, which could encode most values in a packed manner, but could also add large values as, e.g., vints.

          This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings (it adds complexity, slows down decode, and didn't seem to help that much in reducing the size).

          How much worse was compression with straight up packed ints vs vInt(gap)? The byte-header-per-doc is annoying but I don't think we can do much about it today ... really we need a DocValues that allows multiple ints per field (int[]). It would write the bpv once in the header (for the entire segment) and then all docs would use that bpv (and we'd separately have to write doc -> offset). But this is a big change (we need multi-valued DV first)...

          Hmm, it seems like you are writing the full header per field (which is very wasteful)? We should be able to write only one byte (the bpv), if you use getWriterNoHeader instead?

          Show
          Michael McCandless added a comment - Cool! Do you encode the gaps or the straight up ords? (Looks like straight up ords?). Started to look into a semi-packed encoder, which could encode most values in a packed manner, but could also add large values as, e.g., vints. This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings (it adds complexity, slows down decode, and didn't seem to help that much in reducing the size). How much worse was compression with straight up packed ints vs vInt(gap)? The byte-header-per-doc is annoying but I don't think we can do much about it today ... really we need a DocValues that allows multiple ints per field (int[]). It would write the bpv once in the header (for the entire segment) and then all docs would use that bpv (and we'd separately have to write doc -> offset). But this is a big change (we need multi-valued DV first)... Hmm, it seems like you are writing the full header per field (which is very wasteful)? We should be able to write only one byte (the bpv), if you use getWriterNoHeader instead?
          Hide
          Gilad Barkai added a comment -

          Do you encode the gaps or the straight up ords?

          Well, It's a 'end point' encoder, meaning it encodes whatever values are received directly to the output.
          One could create an encoder as: new SortingIntEncoder(new UniqueValuesIntEncoder(new DGapIntEncoder(new PackedEncoder()))), so the values the packed encoder would receive are already after sort, unique and dgap.

          This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings (it adds complexity, slows down decode, and didn't seem to help that much in reducing the size).

          PForDelta is indeed slower. But we've met scenarios in which most dgaps are small - hence the NOnes, and the Four/Eight Flag encoders. If indeed most values are small, say, could fit in 4 bits, but there's also one or two larger values which would require 12 or 14 bits, we could benefit hear greatly.
          This is all relevant only where there are large amount of categories per document.

          it seems like you are writing the full header per field

          That is right. To be frank, I'm not 100% sure what PackedInts does.. nor how large its header is..
          But I think perhaps some header per doc is required anyway? For bits-per-value smaller than the size of a byte, there's a need to know how many bits should be left out from the last read byte.

          I started writing my own version as a first step toward the 'mixed' version, in which a 1 byte header is written, that contained both the the 'bits per value' as the first 5 bits, and the amount of extra bits in the last 3 bits. I'm still playing with it, hope to share it soon.

          Show
          Gilad Barkai added a comment - Do you encode the gaps or the straight up ords? Well, It's a 'end point' encoder, meaning it encodes whatever values are received directly to the output. One could create an encoder as: new SortingIntEncoder(new UniqueValuesIntEncoder(new DGapIntEncoder(new PackedEncoder()))) , so the values the packed encoder would receive are already after sort, unique and dgap. This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings (it adds complexity, slows down decode, and didn't seem to help that much in reducing the size). PForDelta is indeed slower. But we've met scenarios in which most dgaps are small - hence the NOnes, and the Four/Eight Flag encoders. If indeed most values are small, say, could fit in 4 bits, but there's also one or two larger values which would require 12 or 14 bits, we could benefit hear greatly. This is all relevant only where there are large amount of categories per document. it seems like you are writing the full header per field That is right. To be frank, I'm not 100% sure what PackedInts does.. nor how large its header is.. But I think perhaps some header per doc is required anyway? For bits-per-value smaller than the size of a byte, there's a need to know how many bits should be left out from the last read byte. I started writing my own version as a first step toward the 'mixed' version, in which a 1 byte header is written, that contained both the the 'bits per value' as the first 5 bits, and the amount of extra bits in the last 3 bits. I'm still playing with it, hope to share it soon.
          Hide
          Michael McCandless added a comment -

          Well, It's a 'end point' encoder, meaning it encodes whatever values are received directly to the output.

          Ahh right, OK.

          PForDelta is indeed slower. But we've met scenarios in which most dgaps are small - hence the NOnes, and the Four/Eight Flag encoders.

          OK makes sense.

          If indeed most values are small, say, could fit in 4 bits, but there's also one or two larger values which would require 12 or 14 bits, we could benefit hear greatly.
          This is all relevant only where there are large amount of categories per document.

          Right ... I'm just wondering how often this happens in "typical" (if there is such a thing) facet aps. Decode speed trumps compression ratios here, I think.

          That is right. To be frank, I'm not 100% sure what PackedInts does.. nor how large its header is..

          The header is very large ... really you should only need 1) bpv, and 2) bytes.length (which I think you already have, via both payloads and DocValues). If the PackedInts API isn't flexible enough for you to feed it bpv and bytes.length then let's fix that!

          For bits-per-value smaller than the size of a byte, there's a need to know how many bits should be left out from the last read byte.

          Hopefully you don't need to separately encode "leftover unused bits" ... ie byte[].length (which is "free" here, since codec already stores this) should suffice.

          Show
          Michael McCandless added a comment - Well, It's a 'end point' encoder, meaning it encodes whatever values are received directly to the output. Ahh right, OK. PForDelta is indeed slower. But we've met scenarios in which most dgaps are small - hence the NOnes, and the Four/Eight Flag encoders. OK makes sense. If indeed most values are small, say, could fit in 4 bits, but there's also one or two larger values which would require 12 or 14 bits, we could benefit hear greatly. This is all relevant only where there are large amount of categories per document. Right ... I'm just wondering how often this happens in "typical" (if there is such a thing) facet aps. Decode speed trumps compression ratios here, I think. That is right. To be frank, I'm not 100% sure what PackedInts does.. nor how large its header is.. The header is very large ... really you should only need 1) bpv, and 2) bytes.length (which I think you already have, via both payloads and DocValues). If the PackedInts API isn't flexible enough for you to feed it bpv and bytes.length then let's fix that! For bits-per-value smaller than the size of a byte, there's a need to know how many bits should be left out from the last read byte. Hopefully you don't need to separately encode "leftover unused bits" ... ie byte[].length (which is "free" here, since codec already stores this) should suffice.
          Hide
          Gilad Barkai added a comment -

          Hopefully you don't need to separately encode "leftover unused bits" ... ie byte[].length (which is "free" here, since codec already stores this) should suffice.

          I'm missing something.. if there are 2 bits per value, and the codec knows its only 1 byte, there could be either 1, 2, 3 or 4 values in that single byte. How could the decoder know when to stop without knowing how many bits should not be encoded at the end?

          Show
          Gilad Barkai added a comment - Hopefully you don't need to separately encode "leftover unused bits" ... ie byte[].length (which is "free" here, since codec already stores this) should suffice. I'm missing something.. if there are 2 bits per value, and the codec knows its only 1 byte, there could be either 1, 2, 3 or 4 values in that single byte. How could the decoder know when to stop without knowing how many bits should not be encoded at the end?
          Hide
          Michael McCandless added a comment -

          I'm missing something.. if there are 2 bits per value, and the codec knows its only 1 byte, there could be either 1, 2, 3 or 4 values in that single byte. How could the decoder know when to stop without knowing how many bits should not be encoded at the end?

          Ahh you're right! OK. So I think the custom header must include both bpv and "wasted bits". Hmm, but only if bpv is "small enough" to be ambiguous right?

          I guess another option is to leave those bits as 0s so that the decoded ord is 0, which is the "reserved" root ord and so counting that is harmless maybe? Tricky ...

          Show
          Michael McCandless added a comment - I'm missing something.. if there are 2 bits per value, and the codec knows its only 1 byte, there could be either 1, 2, 3 or 4 values in that single byte. How could the decoder know when to stop without knowing how many bits should not be encoded at the end? Ahh you're right! OK. So I think the custom header must include both bpv and "wasted bits". Hmm, but only if bpv is "small enough" to be ambiguous right? I guess another option is to leave those bits as 0s so that the decoded ord is 0, which is the "reserved" root ord and so counting that is harmless maybe? Tricky ...
          Hide
          Gilad Barkai added a comment -

          In a unique encoding dgap, there's no zero, so in order to save that little extra bit, every gap could be encoded as a (gap - 1). Tricky is the word

          Show
          Gilad Barkai added a comment - In a unique encoding dgap, there's no zero, so in order to save that little extra bit, every gap could be encoded as a (gap - 1). Tricky is the word
          Hide
          Adrien Grand added a comment -

          Attached a PackedEncoder, which is based on PackedInts.

          Nice! You could probably improve memory efficiency and speed of the decoder by using a ReaderIterator instead of a Reader:

          • getReader: consumes the packed array stream and returns an in-memory packed array,
          • getDirectReader: does not consume the whole stream and return an impl that uses IndexInput.seek to look up values,
          • getReaderIterator: returns a sequential iterator which bulk-decodes values (the "mem" parameter allows you to control the speed/memory-efficiency trade-off), so it will be much faster than iterating over the values of getReader.

          For improved speed, getReaderIterator has the next(int count) method which returns several values in a single call, this proved to be faster. Another option could be to directly use PackedInts.Encoder/Decoder similarly to Lucene41PostingsFormat (packed writers and reader iterators also use them under the hood).

          This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings

          If the packed stream is very large, another option is to split it into blocks that all have the same number of values (but different number of bits per value). This should prevent the whole stream from growing because of rare extreme values. This is what the stored fields index (with blocks of 1024 values) and Lucene41PostingsFormat (with blocks of 128 values) do. Storing the min value at the beginning of the block and then only encoding deltas could help too.

          The header is very large ... really you should only need 1) bpv, and 2) bytes.length (which I think you already have, via both payloads and DocValues). If the PackedInts API isn't flexible enough for you to feed it bpv and bytes.length then let's fix that!

          Most PackedInts method have a "*NoHeader" variant that does the exact same job whithout relying on a header at the beginning of the stream (LUCENE-4161), I think this is what you are looking for. We should probably make this header stuff opt-in rather than opt-out (by replacing getWriter/Reader/ReaderIterator with the NoHeader methods and adding a method dedicated to reading/writing a header).

          Show
          Adrien Grand added a comment - Attached a PackedEncoder, which is based on PackedInts. Nice! You could probably improve memory efficiency and speed of the decoder by using a ReaderIterator instead of a Reader: getReader: consumes the packed array stream and returns an in-memory packed array, getDirectReader: does not consume the whole stream and return an impl that uses IndexInput.seek to look up values, getReaderIterator: returns a sequential iterator which bulk-decodes values (the "mem" parameter allows you to control the speed/memory-efficiency trade-off), so it will be much faster than iterating over the values of getReader. For improved speed, getReaderIterator has the next(int count) method which returns several values in a single call, this proved to be faster. Another option could be to directly use PackedInts.Encoder/Decoder similarly to Lucene41PostingsFormat (packed writers and reader iterators also use them under the hood). This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings If the packed stream is very large, another option is to split it into blocks that all have the same number of values (but different number of bits per value). This should prevent the whole stream from growing because of rare extreme values. This is what the stored fields index (with blocks of 1024 values) and Lucene41PostingsFormat (with blocks of 128 values) do. Storing the min value at the beginning of the block and then only encoding deltas could help too. The header is very large ... really you should only need 1) bpv, and 2) bytes.length (which I think you already have, via both payloads and DocValues). If the PackedInts API isn't flexible enough for you to feed it bpv and bytes.length then let's fix that! Most PackedInts method have a "*NoHeader" variant that does the exact same job whithout relying on a header at the beginning of the stream ( LUCENE-4161 ), I think this is what you are looking for. We should probably make this header stuff opt-in rather than opt-out (by replacing getWriter/Reader/ReaderIterator with the NoHeader methods and adding a method dedicated to reading/writing a header).
          Hide
          Adrien Grand added a comment -

          Gilad, I created LUCENE-4643 which I assume should be better than PackedInts.Writer and PackedInts.ReaderIterator for your use-case? It doesn't write heavyweight headers (meaning that you need to know the PackedInts version and the size of the stream otherwise) and encodes data in fixed-size blocks.

          Show
          Adrien Grand added a comment - Gilad, I created LUCENE-4643 which I assume should be better than PackedInts.Writer and PackedInts.ReaderIterator for your use-case? It doesn't write heavyweight headers (meaning that you need to know the PackedInts version and the size of the stream otherwise) and encodes data in fixed-size blocks.
          Hide
          Gilad Barkai added a comment -

          Thank you Adrian!
          I'll will look into it.

          Show
          Gilad Barkai added a comment - Thank you Adrian! I'll will look into it.
          Hide
          Michael McCandless added a comment -

          Patch, w/ a "custom" (not using our PackedInts APIs) packed ints encoder/decoder. It only uses as many bytes as are necessary, and packs bpv & "leftoverBits" into a single byte header.

          I tested on first 1M Wikipedia docs ... and performance is much worse than current default in trunk... admittedly it's not quite fair (trunk has specialized vInt/dGap decoder, but patch leaves dGap separate from packed int decode), and admittedly this decoder will be slower than the optimized oal.util.PackedInts ... but perf is so far off that I find it hard to believe PackedInts can match vInt even after optimizing.

          Trunk gets these results:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          PKLookup      203.77      (1.8%)      202.25      (1.8%)   -0.7% (  -4% -    2%)
                          HighTerm       20.43      (1.8%)       20.53      (0.8%)    0.5% (  -2% -    3%)
                           MedTerm       33.12      (1.7%)       33.30      (0.9%)    0.5% (  -2% -    3%)
                           LowTerm       87.55      (3.0%)       88.59      (2.5%)    1.2% (  -4% -    6%)
          

          Patch gets this:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          HighTerm       10.82      (3.6%)       10.69      (4.4%)   -1.2% (  -8% -    7%)
                           MedTerm       19.33      (3.2%)       19.10      (4.0%)   -1.2% (  -8% -    6%)
                           LowTerm       67.75      (2.8%)       67.11      (3.0%)   -0.9% (  -6% -    5%)
                          PKLookup      196.49      (1.0%)      196.24      (1.9%)   -0.1% (  -3% -    2%)
          

          (NOTE: base/comp are the same in each run, so ignore the differences w/in each run (it's noise) and compare absolute across the two runs ... ie HighTerm gets ~20.43 QPS with trunk but ~10.82 with patch).

          Also: trunk took ~63 MB for the DV files while patch took ~84 MB. Net/net I think postings compress better with PackedInts than facet ords (at least for these 9 facet fields I'm using in Wikipedia)...

          Show
          Michael McCandless added a comment - Patch, w/ a "custom" (not using our PackedInts APIs) packed ints encoder/decoder. It only uses as many bytes as are necessary, and packs bpv & "leftoverBits" into a single byte header. I tested on first 1M Wikipedia docs ... and performance is much worse than current default in trunk... admittedly it's not quite fair (trunk has specialized vInt/dGap decoder, but patch leaves dGap separate from packed int decode), and admittedly this decoder will be slower than the optimized oal.util.PackedInts ... but perf is so far off that I find it hard to believe PackedInts can match vInt even after optimizing. Trunk gets these results: Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 203.77 (1.8%) 202.25 (1.8%) -0.7% ( -4% - 2%) HighTerm 20.43 (1.8%) 20.53 (0.8%) 0.5% ( -2% - 3%) MedTerm 33.12 (1.7%) 33.30 (0.9%) 0.5% ( -2% - 3%) LowTerm 87.55 (3.0%) 88.59 (2.5%) 1.2% ( -4% - 6%) Patch gets this: Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 10.82 (3.6%) 10.69 (4.4%) -1.2% ( -8% - 7%) MedTerm 19.33 (3.2%) 19.10 (4.0%) -1.2% ( -8% - 6%) LowTerm 67.75 (2.8%) 67.11 (3.0%) -0.9% ( -6% - 5%) PKLookup 196.49 (1.0%) 196.24 (1.9%) -0.1% ( -3% - 2%) (NOTE: base/comp are the same in each run, so ignore the differences w/in each run (it's noise) and compare absolute across the two runs ... ie HighTerm gets ~20.43 QPS with trunk but ~10.82 with patch). Also: trunk took ~63 MB for the DV files while patch took ~84 MB. Net/net I think postings compress better with PackedInts than facet ords (at least for these 9 facet fields I'm using in Wikipedia)...
          Hide
          Shai Erera added a comment -

          How come the order of the lines in the two tables is different? At first I thought you misread the table because the results don't look that bad .

          You can very easily push DGap into the encoder? But judging from LUCENE-4686, this got us there 6% improvement, so not likely it will double the decoding speed here.
          Also, I see that you have a static MASK table. I wonder, given the recent thread about FixedBitSet and static mask tables (people were against it because a couple of bitwise ops would do better), if getting rid of it would improve perf.

          I think that the main problem with these ordinals encoding is that we encode small groups of integers, usually. So just for fun, I ran EncodingSpeed with 4 encoders/decoders, encoding 2430 values in one go (single call made to encode). The results are below:

          Encoder                            Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                             [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
          -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Sorting(Unique(DGap(PackedInts)))  11.0189                 8561                    85.6105                  590                     5.9000
          Sorting(Unique(DGap(Packed)))      11.0058                 8206                    82.0605                  806                     8.0601
          Sorting(Unique(DGap(VInt8)))        8.5597                 7133                    71.3305                  263                     2.6300
          Sorting(Unique(DGapVInt8))          8.5597                 7208                    72.0805                  251                     2.5100
          
          • PackedInts is an encoder similar to what Gilad wrote here, which uses PackedInts.Writer/ReaderNoHeader (writes a 1-byte header itself).
          • Packed is Mike's encoder
          • I wrapped all encoders with DGap (except for the last one which is specialized)

          Notice that:

          • Both Packed versions achieve worse compression than VInt (confirms Mike's DV files results above)
          • Both DGap perform much faster at decode that the Packed versions.
          • Surprisingly, but this may be a result of micro-benchmkaring, Packed (Mike's specialized) decodes slower than PackedInts. I wonder if luceneutil would confirm that too.

          I think that if we had some statistics about the result Wikipedia index, e.g. min/max ord per document (the final integer, after dgap), we could tell better if there's any point to continue these investigations further. Since luceneutil search perf is happier w/ VInt and more so, the result index is smaller by ~25%, I tend to think that packed ints won't get us far here.

          With postings, I think that the numbers are smaller? e.g. for dense posting lists, with docID gaps, you'll get a small bpv?
          Mike, is it possible to print the min/max bpv used during the indexing? For instance, I printed that info when running EncodingSpeed, and got 11 (there's a single chunk to encode, so min=max). I wonder what the result will be for the index. I added these as public static members to PackedIntEncoder, so I think you can do the same for the index.

          Also, 11 bpv means that every value is encoded int 1.5 bytes + 1 header byte. So for 25 ords, we're looking at ~39 bytes per document. If most of the values are small though (and this happens with ordinals, even with dgap, because the first value is the largest), we pay a high penalty for each value, because bpv will be large.

          Maybe in order to beat VInt we need a smarter encoder. E.g. the facets package have a FourFlags and EightFlags encoders which are very good for really low numbers (Four for 1,2,3 and Eight for 1). Gilad had an idea above to use a static bpv=6 and encode all values that are 1-63 as 0-62 in 6 bits, and values that are 64+ are encoded as 63 (all 1's) followed by a VInt. I ran a short analysis using EncodingSpeed (I should say here that these 2430 integers are actual category ordinals of a real document from one application, so these are not just made up numbers):

          • maxValue=1928 (requires 11 bits)
          • numValuesLessThan64=2169 (out of 2430!)

          So if we had encoded using Gilad's idea, we would end up with

          • 261 "large" values, at most 2 bytes-per-value = 522 bytes (4176 bits)
          • 2169 values encoded with 6 bpv = 135.5625 longs = 1084.5 bytes (8676 bits)
          • avg bits/int = 12852 / 2430 = 5.28

          Which is better than VInt. I omitted the 1 byte header here, because of the large number of values that are encoded at once, but it should add some overhead for documents with very few categories (e.g. OrdinalPolicy.NO_PARENTS). Also, this is the "best" compression that we could get for these numbers, but if we really encoded them I suspect it would be worse. Because we cannot reorder after sorting+dgap, so e.g. if we have a series of 5-1 values (5 small, 1 large), we would waste 2 bits on the 5 small ones. 2 bits here, 2 bits there, they add up.

          Still, I wonder what would be the speed of such decoder. And what are the statistics for the real Wikipedia index, with its 25 ords per doc. Encoders like the one above greatly benefit when there are a large number of values (because there are more chances for small values).

          Show
          Shai Erera added a comment - How come the order of the lines in the two tables is different? At first I thought you misread the table because the results don't look that bad . You can very easily push DGap into the encoder? But judging from LUCENE-4686 , this got us there 6% improvement, so not likely it will double the decoding speed here. Also, I see that you have a static MASK table. I wonder, given the recent thread about FixedBitSet and static mask tables (people were against it because a couple of bitwise ops would do better), if getting rid of it would improve perf. I think that the main problem with these ordinals encoding is that we encode small groups of integers, usually. So just for fun, I ran EncodingSpeed with 4 encoders/decoders, encoding 2430 values in one go (single call made to encode). The results are below: Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorting(Unique(DGap(PackedInts))) 11.0189 8561 85.6105 590 5.9000 Sorting(Unique(DGap(Packed))) 11.0058 8206 82.0605 806 8.0601 Sorting(Unique(DGap(VInt8))) 8.5597 7133 71.3305 263 2.6300 Sorting(Unique(DGapVInt8)) 8.5597 7208 72.0805 251 2.5100 PackedInts is an encoder similar to what Gilad wrote here, which uses PackedInts.Writer/ReaderNoHeader (writes a 1-byte header itself). Packed is Mike's encoder I wrapped all encoders with DGap (except for the last one which is specialized) Notice that: Both Packed versions achieve worse compression than VInt (confirms Mike's DV files results above) Both DGap perform much faster at decode that the Packed versions. Surprisingly, but this may be a result of micro-benchmkaring, Packed (Mike's specialized) decodes slower than PackedInts. I wonder if luceneutil would confirm that too. I think that if we had some statistics about the result Wikipedia index, e.g. min/max ord per document (the final integer, after dgap), we could tell better if there's any point to continue these investigations further. Since luceneutil search perf is happier w/ VInt and more so, the result index is smaller by ~25%, I tend to think that packed ints won't get us far here. With postings, I think that the numbers are smaller? e.g. for dense posting lists, with docID gaps, you'll get a small bpv? Mike, is it possible to print the min/max bpv used during the indexing? For instance, I printed that info when running EncodingSpeed, and got 11 (there's a single chunk to encode, so min=max). I wonder what the result will be for the index. I added these as public static members to PackedIntEncoder, so I think you can do the same for the index. Also, 11 bpv means that every value is encoded int 1.5 bytes + 1 header byte. So for 25 ords, we're looking at ~39 bytes per document. If most of the values are small though (and this happens with ordinals, even with dgap, because the first value is the largest), we pay a high penalty for each value, because bpv will be large. Maybe in order to beat VInt we need a smarter encoder. E.g. the facets package have a FourFlags and EightFlags encoders which are very good for really low numbers (Four for 1,2,3 and Eight for 1). Gilad had an idea above to use a static bpv=6 and encode all values that are 1-63 as 0-62 in 6 bits, and values that are 64+ are encoded as 63 (all 1's) followed by a VInt. I ran a short analysis using EncodingSpeed (I should say here that these 2430 integers are actual category ordinals of a real document from one application, so these are not just made up numbers): maxValue=1928 (requires 11 bits) numValuesLessThan64=2169 (out of 2430!) So if we had encoded using Gilad's idea, we would end up with 261 "large" values, at most 2 bytes-per-value = 522 bytes (4176 bits) 2169 values encoded with 6 bpv = 135.5625 longs = 1084.5 bytes (8676 bits) avg bits/int = 12852 / 2430 = 5.28 Which is better than VInt. I omitted the 1 byte header here, because of the large number of values that are encoded at once, but it should add some overhead for documents with very few categories (e.g. OrdinalPolicy.NO_PARENTS). Also, this is the "best" compression that we could get for these numbers, but if we really encoded them I suspect it would be worse. Because we cannot reorder after sorting+dgap, so e.g. if we have a series of 5-1 values (5 small, 1 large), we would waste 2 bits on the 5 small ones. 2 bits here, 2 bits there, they add up. Still, I wonder what would be the speed of such decoder. And what are the statistics for the real Wikipedia index, with its 25 ords per doc. Encoders like the one above greatly benefit when there are a large number of values (because there are more chances for small values).
          Hide
          Michael McCandless added a comment -

          Here's another attempt (totally prototype / not committable) at using PackedInts to hold the ords ...

          It's hacked up: it visits all byte[] from DocValues in the index and converts to in-RAM PackedInts arrays, and then does all facet counting from those arrays.

          But, the performance is sort of 'meh':

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                           MedTerm      109.40      (1.5%)      102.06      (1.5%)   -6.7% (  -9% -   -3%)
                        AndHighLow      374.95      (3.0%)      361.19      (2.6%)   -3.7% (  -8% -    1%)
                        AndHighMed      172.57      (1.5%)      169.35      (1.1%)   -1.9% (  -4% -    0%)
                           Prefix3      177.54      (6.2%)      174.26      (8.0%)   -1.8% ( -15% -   13%)
                            IntNRQ      116.07      (7.5%)      113.97      (9.3%)   -1.8% ( -17% -   16%)
                            Fuzzy2       86.19      (2.4%)       85.16      (2.8%)   -1.2% (  -6% -    4%)
                       AndHighHigh       46.76      (1.4%)       46.36      (1.1%)   -0.8% (  -3% -    1%)
                           LowTerm      146.56      (1.8%)      145.58      (1.4%)   -0.7% (  -3% -    2%)
                          HighTerm       26.35      (2.0%)       26.20      (2.1%)   -0.6% (  -4% -    3%)
                       MedSpanNear       64.98      (2.3%)       64.62      (2.8%)   -0.5% (  -5% -    4%)
                   LowSloppyPhrase       67.07      (2.3%)       66.80      (3.6%)   -0.4% (  -6% -    5%)
                         OrHighMed       25.18      (1.6%)       25.10      (2.1%)   -0.3% (  -3% -    3%)
                          Wildcard      256.33      (3.1%)      255.56      (3.5%)   -0.3% (  -6% -    6%)
                          PKLookup      305.42      (2.3%)      304.72      (2.1%)   -0.2% (  -4% -    4%)
                         OrHighLow       24.59      (1.3%)       24.54      (2.2%)   -0.2% (  -3% -    3%)
                            Fuzzy1       81.38      (3.0%)       81.60      (2.7%)    0.3% (  -5% -    6%)
                           Respell      141.17      (3.8%)      141.87      (3.9%)    0.5% (  -6% -    8%)
                       LowSpanNear       38.34      (3.2%)       38.78      (3.0%)    1.1% (  -4% -    7%)
                   MedSloppyPhrase       63.80      (2.1%)       64.53      (3.5%)    1.1% (  -4% -    6%)
                      HighSpanNear       10.20      (2.8%)       10.32      (3.1%)    1.2% (  -4% -    7%)
                         MedPhrase      103.16      (4.5%)      104.72      (2.1%)    1.5% (  -4% -    8%)
                        OrHighHigh       17.81      (1.5%)       18.18      (2.7%)    2.1% (  -2% -    6%)
                         LowPhrase       58.77      (5.5%)       60.49      (3.0%)    2.9% (  -5% -   12%)
                        HighPhrase       38.68     (10.0%)       40.46      (5.6%)    4.6% ( -10% -   22%)
                  HighSloppyPhrase        2.97      (7.9%)        3.22     (12.6%)    8.3% ( -11% -   31%)
          
          

          Maybe if I used the bulk read PackedInts APIs instead it would be better...

          Show
          Michael McCandless added a comment - Here's another attempt (totally prototype / not committable) at using PackedInts to hold the ords ... It's hacked up: it visits all byte[] from DocValues in the index and converts to in-RAM PackedInts arrays, and then does all facet counting from those arrays. But, the performance is sort of 'meh': Task QPS base StdDev QPS comp StdDev Pct diff MedTerm 109.40 (1.5%) 102.06 (1.5%) -6.7% ( -9% - -3%) AndHighLow 374.95 (3.0%) 361.19 (2.6%) -3.7% ( -8% - 1%) AndHighMed 172.57 (1.5%) 169.35 (1.1%) -1.9% ( -4% - 0%) Prefix3 177.54 (6.2%) 174.26 (8.0%) -1.8% ( -15% - 13%) IntNRQ 116.07 (7.5%) 113.97 (9.3%) -1.8% ( -17% - 16%) Fuzzy2 86.19 (2.4%) 85.16 (2.8%) -1.2% ( -6% - 4%) AndHighHigh 46.76 (1.4%) 46.36 (1.1%) -0.8% ( -3% - 1%) LowTerm 146.56 (1.8%) 145.58 (1.4%) -0.7% ( -3% - 2%) HighTerm 26.35 (2.0%) 26.20 (2.1%) -0.6% ( -4% - 3%) MedSpanNear 64.98 (2.3%) 64.62 (2.8%) -0.5% ( -5% - 4%) LowSloppyPhrase 67.07 (2.3%) 66.80 (3.6%) -0.4% ( -6% - 5%) OrHighMed 25.18 (1.6%) 25.10 (2.1%) -0.3% ( -3% - 3%) Wildcard 256.33 (3.1%) 255.56 (3.5%) -0.3% ( -6% - 6%) PKLookup 305.42 (2.3%) 304.72 (2.1%) -0.2% ( -4% - 4%) OrHighLow 24.59 (1.3%) 24.54 (2.2%) -0.2% ( -3% - 3%) Fuzzy1 81.38 (3.0%) 81.60 (2.7%) 0.3% ( -5% - 6%) Respell 141.17 (3.8%) 141.87 (3.9%) 0.5% ( -6% - 8%) LowSpanNear 38.34 (3.2%) 38.78 (3.0%) 1.1% ( -4% - 7%) MedSloppyPhrase 63.80 (2.1%) 64.53 (3.5%) 1.1% ( -4% - 6%) HighSpanNear 10.20 (2.8%) 10.32 (3.1%) 1.2% ( -4% - 7%) MedPhrase 103.16 (4.5%) 104.72 (2.1%) 1.5% ( -4% - 8%) OrHighHigh 17.81 (1.5%) 18.18 (2.7%) 2.1% ( -2% - 6%) LowPhrase 58.77 (5.5%) 60.49 (3.0%) 2.9% ( -5% - 12%) HighPhrase 38.68 (10.0%) 40.46 (5.6%) 4.6% ( -10% - 22%) HighSloppyPhrase 2.97 (7.9%) 3.22 (12.6%) 8.3% ( -11% - 31%) Maybe if I used the bulk read PackedInts APIs instead it would be better...
          Hide
          Shai Erera added a comment -

          Perhaps this should belong to a different issue? I mean, this one is focused on a PackedInts encoder/decoder (or any other encoder/decoder that's better than VInt).

          Separately, it's interesting that this performs worse than DV Source + decoding. I mean, the results don't factor in reading and populating the cache, right? The test is already "hot" when it's measured?

          I must say that I'm not entirely surprised ... having recently looked at PackedInts API, it doesn't look so optimized (working w/ DataInput for example), where the dgap+vint that we have in CountingFC is very optimized. I think that what we need is a custom cache, which encodes things similar to PackedInts ... or maybe as you say, the bulk read methods would do better.

          But we should explore that on another issue?

          Show
          Shai Erera added a comment - Perhaps this should belong to a different issue? I mean, this one is focused on a PackedInts encoder/decoder (or any other encoder/decoder that's better than VInt). Separately, it's interesting that this performs worse than DV Source + decoding. I mean, the results don't factor in reading and populating the cache, right? The test is already "hot" when it's measured? I must say that I'm not entirely surprised ... having recently looked at PackedInts API, it doesn't look so optimized (working w/ DataInput for example), where the dgap+vint that we have in CountingFC is very optimized. I think that what we need is a custom cache, which encodes things similar to PackedInts ... or maybe as you say, the bulk read methods would do better. But we should explore that on another issue?
          Hide
          Michael McCandless added a comment -

          Also, the on-disk size of the vInt(dGap) encoded ords is 63880 KB while the in-RAM packed ints size was 74501 KB. Maybe if we block-coded the packed ints parts we'd get better compression ...

          Show
          Michael McCandless added a comment - Also, the on-disk size of the vInt(dGap) encoded ords is 63880 KB while the in-RAM packed ints size was 74501 KB. Maybe if we block-coded the packed ints parts we'd get better compression ...
          Hide
          Michael McCandless added a comment -

          Perhaps this should belong to a different issue?

          I think this is the right issue to explore whether packed ints can be smaller/faster for facets?

          Ie I think we should iterate on this prototype/specialized collector, and if we don't see net gains with it then I don't think we should pursue packed ints encoder/decoder. This serves as the litmus test.

          Show
          Michael McCandless added a comment - Perhaps this should belong to a different issue? I think this is the right issue to explore whether packed ints can be smaller/faster for facets? Ie I think we should iterate on this prototype/specialized collector, and if we don't see net gains with it then I don't think we should pursue packed ints encoder/decoder. This serves as the litmus test.
          Hide
          Michael McCandless added a comment -

          On the DV 2.0 branch the on-disk size is 55288 KB (~16% smaller): cool!

          Show
          Michael McCandless added a comment - On the DV 2.0 branch the on-disk size is 55288 KB (~16% smaller): cool!
          Hide
          Shai Erera added a comment -

          and if we don't see net gains with it then I don't think we should pursue packed ints encoder/decoder

          That's right. But if we'll see net gains, it doesn't mean anything about how it will perform on small set of integers.
          That's why I think this test has nothing to do w/ the Encoder/Decoder.

          But I don't mind if this experiment is done here anyway.

          Show
          Shai Erera added a comment - and if we don't see net gains with it then I don't think we should pursue packed ints encoder/decoder That's right. But if we'll see net gains, it doesn't mean anything about how it will perform on small set of integers. That's why I think this test has nothing to do w/ the Encoder/Decoder. But I don't mind if this experiment is done here anyway.
          Hide
          Michael McCandless added a comment -

          New prototype collector, this time using simple int[] instead of PackedInts.

          Trunk (base) vs prototype collector (comp):

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                            IntNRQ      114.81      (6.2%)      112.35      (8.4%)   -2.1% ( -15% -   13%)
                           Prefix3      176.77      (4.7%)      173.10      (7.4%)   -2.1% ( -13% -   10%)
                          Wildcard      254.90      (3.2%)      250.81      (3.3%)   -1.6% (  -7% -    5%)
                        AndHighLow      371.35      (2.6%)      366.23      (2.3%)   -1.4% (  -6% -    3%)
                          PKLookup      302.90      (1.7%)      299.45      (1.7%)   -1.1% (  -4% -    2%)
                           Respell      143.44      (3.1%)      143.18      (3.4%)   -0.2% (  -6% -    6%)
                            Fuzzy2       86.16      (2.0%)       88.32      (3.1%)    2.5% (  -2% -    7%)
                   LowSloppyPhrase       67.41      (1.8%)       69.45      (2.9%)    3.0% (  -1% -    7%)
                       LowSpanNear       37.85      (2.6%)       39.38      (3.0%)    4.0% (  -1% -    9%)
                      HighSpanNear       10.19      (2.6%)       10.62      (3.2%)    4.2% (  -1% -   10%)
                           MedTerm      111.19      (1.4%)      117.18      (1.6%)    5.4% (   2% -    8%)
                            Fuzzy1       83.60      (2.5%)       88.65      (2.8%)    6.0% (   0% -   11%)
                        AndHighMed      171.63      (1.4%)      182.81      (2.0%)    6.5% (   3% -   10%)
                       MedSpanNear       64.59      (2.0%)       69.13      (2.1%)    7.0% (   2% -   11%)
                         LowPhrase       57.89      (5.3%)       63.54      (4.5%)    9.8% (   0% -   20%)
                        HighPhrase       37.97     (11.0%)       41.79      (8.3%)   10.1% (  -8% -   32%)
                   MedSloppyPhrase       63.51      (2.0%)       70.31      (3.2%)   10.7% (   5% -   16%)
                           LowTerm      145.85      (1.5%)      169.28      (1.6%)   16.1% (  12% -   19%)
                  HighSloppyPhrase        2.97      (8.4%)        3.47     (12.4%)   16.6% (  -3% -   40%)
                       AndHighHigh       46.49      (1.0%)       54.30      (1.2%)   16.8% (  14% -   19%)
                         MedPhrase      101.99      (4.1%)      128.31      (4.7%)   25.8% (  16% -   36%)
                         OrHighMed       24.97      (1.7%)       35.04      (3.6%)   40.3% (  34% -   46%)
                          HighTerm       26.22      (1.2%)       37.55      (3.6%)   43.2% (  38% -   48%)
                         OrHighLow       24.31      (1.5%)       34.89      (3.8%)   43.5% (  37% -   49%)
                        OrHighHigh       17.72      (1.4%)       26.44      (4.5%)   49.3% (  42% -   55%)
          

          So this is at least good news ... it means if we can speed up decode there are gain to be had ... but RAM usage is now 105231 KB (hmm not THAT much larger than 63880 KB ... interesting).

          Show
          Michael McCandless added a comment - New prototype collector, this time using simple int[] instead of PackedInts. Trunk (base) vs prototype collector (comp): Task QPS base StdDev QPS comp StdDev Pct diff IntNRQ 114.81 (6.2%) 112.35 (8.4%) -2.1% ( -15% - 13%) Prefix3 176.77 (4.7%) 173.10 (7.4%) -2.1% ( -13% - 10%) Wildcard 254.90 (3.2%) 250.81 (3.3%) -1.6% ( -7% - 5%) AndHighLow 371.35 (2.6%) 366.23 (2.3%) -1.4% ( -6% - 3%) PKLookup 302.90 (1.7%) 299.45 (1.7%) -1.1% ( -4% - 2%) Respell 143.44 (3.1%) 143.18 (3.4%) -0.2% ( -6% - 6%) Fuzzy2 86.16 (2.0%) 88.32 (3.1%) 2.5% ( -2% - 7%) LowSloppyPhrase 67.41 (1.8%) 69.45 (2.9%) 3.0% ( -1% - 7%) LowSpanNear 37.85 (2.6%) 39.38 (3.0%) 4.0% ( -1% - 9%) HighSpanNear 10.19 (2.6%) 10.62 (3.2%) 4.2% ( -1% - 10%) MedTerm 111.19 (1.4%) 117.18 (1.6%) 5.4% ( 2% - 8%) Fuzzy1 83.60 (2.5%) 88.65 (2.8%) 6.0% ( 0% - 11%) AndHighMed 171.63 (1.4%) 182.81 (2.0%) 6.5% ( 3% - 10%) MedSpanNear 64.59 (2.0%) 69.13 (2.1%) 7.0% ( 2% - 11%) LowPhrase 57.89 (5.3%) 63.54 (4.5%) 9.8% ( 0% - 20%) HighPhrase 37.97 (11.0%) 41.79 (8.3%) 10.1% ( -8% - 32%) MedSloppyPhrase 63.51 (2.0%) 70.31 (3.2%) 10.7% ( 5% - 16%) LowTerm 145.85 (1.5%) 169.28 (1.6%) 16.1% ( 12% - 19%) HighSloppyPhrase 2.97 (8.4%) 3.47 (12.4%) 16.6% ( -3% - 40%) AndHighHigh 46.49 (1.0%) 54.30 (1.2%) 16.8% ( 14% - 19%) MedPhrase 101.99 (4.1%) 128.31 (4.7%) 25.8% ( 16% - 36%) OrHighMed 24.97 (1.7%) 35.04 (3.6%) 40.3% ( 34% - 46%) HighTerm 26.22 (1.2%) 37.55 (3.6%) 43.2% ( 38% - 48%) OrHighLow 24.31 (1.5%) 34.89 (3.8%) 43.5% ( 37% - 49%) OrHighHigh 17.72 (1.4%) 26.44 (4.5%) 49.3% ( 42% - 55%) So this is at least good news ... it means if we can speed up decode there are gain to be had ... but RAM usage is now 105231 KB (hmm not THAT much larger than 63880 KB ... interesting).
          Hide
          Michael McCandless added a comment -

          That's right. But if we'll see net gains, it doesn't mean anything about how it will perform on small set of integers.
          That's why I think this test has nothing to do w/ the Encoder/Decoder.

          Ahh, I see: you're right.

          If this prototype collector is faster it's not clear how we'd "productize" it. Maybe as multi-valued DV (int[] per doc), which could then use big packed ints array under the hood or something ...

          Show
          Michael McCandless added a comment - That's right. But if we'll see net gains, it doesn't mean anything about how it will perform on small set of integers. That's why I think this test has nothing to do w/ the Encoder/Decoder. Ahh, I see: you're right. If this prototype collector is faster it's not clear how we'd "productize" it. Maybe as multi-valued DV (int[] per doc), which could then use big packed ints array under the hood or something ...
          Hide
          Michael McCandless added a comment -

          The above results were 1M index; here's the full wikipedia en (6.6M docs) results:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                      HighSpanNear        2.91      (2.1%)        2.90      (2.4%)   -0.6% (  -5% -    4%)
                           Prefix3       46.35      (4.0%)       46.07      (3.9%)   -0.6% (  -8% -    7%)
                          PKLookup      240.11      (1.4%)      238.95      (1.9%)   -0.5% (  -3% -    2%)
                          Wildcard       73.79      (2.2%)       73.48      (2.3%)   -0.4% (  -4% -    4%)
                            IntNRQ       18.05      (6.1%)       18.01      (5.9%)   -0.2% ( -11% -   12%)
                           Respell       96.78      (3.1%)       98.09      (3.3%)    1.3% (  -4% -    7%)
                   LowSloppyPhrase       17.63      (4.4%)       17.91      (3.8%)    1.6% (  -6% -   10%)
                        AndHighLow      108.80      (2.8%)      110.58      (4.2%)    1.6% (  -5% -    8%)
                       LowSpanNear        7.53      (4.8%)        7.67      (5.6%)    1.8% (  -8% -   12%)
                  HighSloppyPhrase        0.87     (10.1%)        0.90      (9.6%)    3.2% ( -14% -   25%)
                            Fuzzy2       42.22      (2.5%)       43.90      (2.7%)    4.0% (  -1% -    9%)
                        HighPhrase       15.32      (7.5%)       15.93      (5.4%)    4.0% (  -8% -   18%)
                         LowPhrase       17.09      (4.3%)       18.10      (2.9%)    5.9% (  -1% -   13%)
                        AndHighMed       52.60      (1.4%)       55.90      (2.1%)    6.3% (   2% -    9%)
                       MedSpanNear       20.09      (2.0%)       21.44      (1.8%)    6.7% (   2% -   10%)
                   MedSloppyPhrase       18.69      (3.0%)       20.00      (2.7%)    7.0% (   1% -   13%)
                            Fuzzy1       33.68      (2.0%)       37.26      (2.2%)   10.6% (   6% -   15%)
                         MedPhrase       57.00      (2.9%)       63.56      (3.3%)   11.5% (   5% -   18%)
                           MedTerm       19.22      (1.2%)       21.70      (1.1%)   12.9% (  10% -   15%)
                           LowTerm       41.98      (1.2%)       48.26      (1.8%)   15.0% (  11% -   18%)
                       AndHighHigh       12.09      (1.0%)       13.98      (1.2%)   15.7% (  13% -   18%)
                          HighTerm        7.11      (2.1%)        9.11      (2.0%)   28.1% (  23% -   32%)
                         OrHighMed        6.67      (2.4%)        8.55      (2.1%)   28.2% (  23% -   33%)
                         OrHighLow        6.76      (2.1%)        8.70      (2.3%)   28.6% (  23% -   33%)
                        OrHighHigh        3.84      (2.5%)        5.33      (2.7%)   38.7% (  32% -   45%)
          

          On-disk size of _dv* is 464768 KB and in memory int[] is 669428 KB (44% more).

          Next I'll try NO_PARENTS ord policy...

          Show
          Michael McCandless added a comment - The above results were 1M index; here's the full wikipedia en (6.6M docs) results: Task QPS base StdDev QPS comp StdDev Pct diff HighSpanNear 2.91 (2.1%) 2.90 (2.4%) -0.6% ( -5% - 4%) Prefix3 46.35 (4.0%) 46.07 (3.9%) -0.6% ( -8% - 7%) PKLookup 240.11 (1.4%) 238.95 (1.9%) -0.5% ( -3% - 2%) Wildcard 73.79 (2.2%) 73.48 (2.3%) -0.4% ( -4% - 4%) IntNRQ 18.05 (6.1%) 18.01 (5.9%) -0.2% ( -11% - 12%) Respell 96.78 (3.1%) 98.09 (3.3%) 1.3% ( -4% - 7%) LowSloppyPhrase 17.63 (4.4%) 17.91 (3.8%) 1.6% ( -6% - 10%) AndHighLow 108.80 (2.8%) 110.58 (4.2%) 1.6% ( -5% - 8%) LowSpanNear 7.53 (4.8%) 7.67 (5.6%) 1.8% ( -8% - 12%) HighSloppyPhrase 0.87 (10.1%) 0.90 (9.6%) 3.2% ( -14% - 25%) Fuzzy2 42.22 (2.5%) 43.90 (2.7%) 4.0% ( -1% - 9%) HighPhrase 15.32 (7.5%) 15.93 (5.4%) 4.0% ( -8% - 18%) LowPhrase 17.09 (4.3%) 18.10 (2.9%) 5.9% ( -1% - 13%) AndHighMed 52.60 (1.4%) 55.90 (2.1%) 6.3% ( 2% - 9%) MedSpanNear 20.09 (2.0%) 21.44 (1.8%) 6.7% ( 2% - 10%) MedSloppyPhrase 18.69 (3.0%) 20.00 (2.7%) 7.0% ( 1% - 13%) Fuzzy1 33.68 (2.0%) 37.26 (2.2%) 10.6% ( 6% - 15%) MedPhrase 57.00 (2.9%) 63.56 (3.3%) 11.5% ( 5% - 18%) MedTerm 19.22 (1.2%) 21.70 (1.1%) 12.9% ( 10% - 15%) LowTerm 41.98 (1.2%) 48.26 (1.8%) 15.0% ( 11% - 18%) AndHighHigh 12.09 (1.0%) 13.98 (1.2%) 15.7% ( 13% - 18%) HighTerm 7.11 (2.1%) 9.11 (2.0%) 28.1% ( 23% - 32%) OrHighMed 6.67 (2.4%) 8.55 (2.1%) 28.2% ( 23% - 33%) OrHighLow 6.76 (2.1%) 8.70 (2.3%) 28.6% ( 23% - 33%) OrHighHigh 3.84 (2.5%) 5.33 (2.7%) 38.7% ( 32% - 45%) On-disk size of _dv* is 464768 KB and in memory int[] is 669428 KB (44% more). Next I'll try NO_PARENTS ord policy...
          Hide
          Michael McCandless added a comment -

          Ugh! My DV total bytes numbers were too high: luceneutil also indexes
          title field as DV. So ignore past byte sizes ... here's the [correct,
          I hope!] byte sizes for the NO_PARENTS case, full 6.6M Wikipedia en
          index: DV (index) 151208 KB, int[] (in RAM): 305889 KB. And
          NO_PARENTS perf (base = trunk, comp = int[] collector):

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          Wildcard       74.70      (3.3%)       74.32      (1.9%)   -0.5% (  -5% -    4%)
                          PKLookup      245.87      (1.8%)      244.80      (2.0%)   -0.4% (  -4% -    3%)
                        HighPhrase       15.68      (5.7%)       15.72      (6.4%)    0.2% ( -11% -   12%)
                           Respell      111.09      (3.5%)      111.33      (3.7%)    0.2% (  -6% -    7%)
                        AndHighLow       97.90      (1.6%)       98.16      (1.4%)    0.3% (  -2% -    3%)
                       LowSpanNear        7.62      (3.8%)        7.67      (3.5%)    0.7% (  -6% -    8%)
                           Prefix3       45.94      (5.6%)       46.34      (2.7%)    0.9% (  -6% -    9%)
                            IntNRQ       18.04      (8.2%)       18.20      (4.6%)    0.9% ( -11% -   14%)
                   LowSloppyPhrase       17.77      (2.9%)       17.94      (4.8%)    1.0% (  -6% -    8%)
                            Fuzzy2       41.36      (2.4%)       42.68      (2.3%)    3.2% (  -1% -    8%)
                         LowPhrase       16.94      (2.4%)       17.65      (3.5%)    4.1% (  -1% -   10%)
                      HighSpanNear        2.98      (2.8%)        3.14      (2.1%)    5.3% (   0% -   10%)
                        AndHighMed       49.18      (1.0%)       51.97      (0.7%)    5.7% (   3% -    7%)
                  HighSloppyPhrase        0.90      (6.7%)        0.97     (12.6%)    6.8% ( -11% -   27%)
                   MedSloppyPhrase       18.54      (1.8%)       19.91      (3.0%)    7.4% (   2% -   12%)
                       MedSpanNear       19.86      (1.6%)       21.36      (2.0%)    7.5% (   3% -   11%)
                         MedPhrase       55.57      (2.2%)       60.31      (2.3%)    8.5% (   3% -   13%)
                            Fuzzy1       33.38      (1.4%)       37.19      (1.9%)   11.4% (   8% -   14%)
                       AndHighHigh       12.58      (1.2%)       14.66      (0.9%)   16.6% (  14% -   18%)
                           LowTerm       40.41      (1.2%)       47.14      (1.4%)   16.6% (  13% -   19%)
                           MedTerm       23.00      (1.4%)       27.14      (3.0%)   18.0% (  13% -   22%)
                         OrHighMed        7.50      (2.2%)       10.16      (2.3%)   35.6% (  30% -   40%)
                         OrHighLow        7.55      (2.0%)       10.30      (2.8%)   36.3% (  30% -   41%)
                          HighTerm        7.92      (1.9%)       10.98      (2.8%)   38.6% (  33% -   44%)
                        OrHighHigh        4.30      (2.7%)        6.39      (3.0%)   48.6% (  41% -   55%)
          
          Show
          Michael McCandless added a comment - Ugh! My DV total bytes numbers were too high: luceneutil also indexes title field as DV. So ignore past byte sizes ... here's the [correct, I hope!] byte sizes for the NO_PARENTS case, full 6.6M Wikipedia en index: DV (index) 151208 KB, int[] (in RAM): 305889 KB. And NO_PARENTS perf (base = trunk, comp = int[] collector): Task QPS base StdDev QPS comp StdDev Pct diff Wildcard 74.70 (3.3%) 74.32 (1.9%) -0.5% ( -5% - 4%) PKLookup 245.87 (1.8%) 244.80 (2.0%) -0.4% ( -4% - 3%) HighPhrase 15.68 (5.7%) 15.72 (6.4%) 0.2% ( -11% - 12%) Respell 111.09 (3.5%) 111.33 (3.7%) 0.2% ( -6% - 7%) AndHighLow 97.90 (1.6%) 98.16 (1.4%) 0.3% ( -2% - 3%) LowSpanNear 7.62 (3.8%) 7.67 (3.5%) 0.7% ( -6% - 8%) Prefix3 45.94 (5.6%) 46.34 (2.7%) 0.9% ( -6% - 9%) IntNRQ 18.04 (8.2%) 18.20 (4.6%) 0.9% ( -11% - 14%) LowSloppyPhrase 17.77 (2.9%) 17.94 (4.8%) 1.0% ( -6% - 8%) Fuzzy2 41.36 (2.4%) 42.68 (2.3%) 3.2% ( -1% - 8%) LowPhrase 16.94 (2.4%) 17.65 (3.5%) 4.1% ( -1% - 10%) HighSpanNear 2.98 (2.8%) 3.14 (2.1%) 5.3% ( 0% - 10%) AndHighMed 49.18 (1.0%) 51.97 (0.7%) 5.7% ( 3% - 7%) HighSloppyPhrase 0.90 (6.7%) 0.97 (12.6%) 6.8% ( -11% - 27%) MedSloppyPhrase 18.54 (1.8%) 19.91 (3.0%) 7.4% ( 2% - 12%) MedSpanNear 19.86 (1.6%) 21.36 (2.0%) 7.5% ( 3% - 11%) MedPhrase 55.57 (2.2%) 60.31 (2.3%) 8.5% ( 3% - 13%) Fuzzy1 33.38 (1.4%) 37.19 (1.9%) 11.4% ( 8% - 14%) AndHighHigh 12.58 (1.2%) 14.66 (0.9%) 16.6% ( 14% - 18%) LowTerm 40.41 (1.2%) 47.14 (1.4%) 16.6% ( 13% - 19%) MedTerm 23.00 (1.4%) 27.14 (3.0%) 18.0% ( 13% - 22%) OrHighMed 7.50 (2.2%) 10.16 (2.3%) 35.6% ( 30% - 40%) OrHighLow 7.55 (2.0%) 10.30 (2.8%) 36.3% ( 30% - 41%) HighTerm 7.92 (1.9%) 10.98 (2.8%) 38.6% ( 33% - 44%) OrHighHigh 4.30 (2.7%) 6.39 (3.0%) 48.6% ( 41% - 55%)
          Hide
          Adrien Grand added a comment -

          I tried to hack up a patch to make packed ints behave better for this use-case. I made byte[] -> int[] encoding and decoding byte-aligned (it was still long-aligned) and wrote a quick class which can decode an arbitrary number of packed ints at any offset. I've tried to run the benchmark but it looks like it didn't use this CountingFacetCollector (I added doFacets=True to my localrun.py), do I need to modify some code in the perf package?

          Show
          Adrien Grand added a comment - I tried to hack up a patch to make packed ints behave better for this use-case. I made byte[] -> int[] encoding and decoding byte-aligned (it was still long-aligned) and wrote a quick class which can decode an arbitrary number of packed ints at any offset. I've tried to run the benchmark but it looks like it didn't use this CountingFacetCollector (I added doFacets=True to my localrun.py), do I need to modify some code in the perf package?
          Hide
          Michael McCandless added a comment -

          Ooh that patch looks good! Thanks Adrien. I'll test it.

          Did you add doFacets=True to both the Index and the Competitor? Also, you need to use a tasks file that adds +dateFacets or +allFacets to each task ... and if you want to use +allFacets you need to pull down the latest line file docs that has the added facet dimensions ...

          Show
          Michael McCandless added a comment - Ooh that patch looks good! Thanks Adrien. I'll test it. Did you add doFacets=True to both the Index and the Competitor? Also, you need to use a tasks file that adds +dateFacets or +allFacets to each task ... and if you want to use +allFacets you need to pull down the latest line file docs that has the added facet dimensions ...
          Hide
          Michael McCandless added a comment -

          I had to change the PackedBytes.get to take a [reused] IntsRef in, else I was hitting thread-safety issues (AIOOBE)...

          base = trunk, comp = patch, index = full 6.6M Wikpedia English with 9 facet dims counted:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          HighTerm        7.30      (1.5%)        7.60      (1.6%)    4.1% (   0% -    7%)
                           MedTerm       16.22      (0.7%)       17.25      (0.8%)    6.4% (   4% -    7%)
                           LowTerm       37.87      (0.8%)       41.08      (0.4%)    8.5% (   7% -    9%)
          

          So we finally have something faster than dGap(vInt)!

          But PackedInts takes ~2X the storage (base is 151208 KB, by summing DV on disk; comp is 305889 KB by measuring RAM of the PackedInts structures).

          Show
          Michael McCandless added a comment - I had to change the PackedBytes.get to take a [reused] IntsRef in, else I was hitting thread-safety issues (AIOOBE)... base = trunk, comp = patch, index = full 6.6M Wikpedia English with 9 facet dims counted: Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 7.30 (1.5%) 7.60 (1.6%) 4.1% ( 0% - 7%) MedTerm 16.22 (0.7%) 17.25 (0.8%) 6.4% ( 4% - 7%) LowTerm 37.87 (0.8%) 41.08 (0.4%) 8.5% ( 7% - 9%) So we finally have something faster than dGap(vInt)! But PackedInts takes ~2X the storage (base is 151208 KB, by summing DV on disk; comp is 305889 KB by measuring RAM of the PackedInts structures).
          Hide
          Adrien Grand added a comment -

          Also, you need to use a tasks file that adds +dateFacets or +allFacets to each task ... and if you want to use +allFacets you need to pull down the latest line file docs that has the added facet dimensions ...

          OK, this is what I was missing.

          I had to change the PackedBytes.get to take a [reused] IntsRef in, else I was hitting thread-safety issues (AIOOBE)...

          Oops, I didn't know it would be called from multiple threads.

          So we finally have something faster than dGap(vInt)!

          Good news! However the patch has a nocommit because it uses a byte[] to store data, so it cannot grow beyond 2G. I hope that the paging won't make it too much slower. (But maybe it could help reduce memory usage, if each page can have a different number of bits per value?)

          I think I'll open a separate issue to make encoding to byte[] and decoding from byte[] byte-aligned (it is long-aligned today). I ran a benchmark with the Lucene41 PF and the deltas were small (probably noise).

          Show
          Adrien Grand added a comment - Also, you need to use a tasks file that adds +dateFacets or +allFacets to each task ... and if you want to use +allFacets you need to pull down the latest line file docs that has the added facet dimensions ... OK, this is what I was missing. I had to change the PackedBytes.get to take a [reused] IntsRef in, else I was hitting thread-safety issues (AIOOBE)... Oops, I didn't know it would be called from multiple threads. So we finally have something faster than dGap(vInt)! Good news! However the patch has a nocommit because it uses a byte[] to store data, so it cannot grow beyond 2G. I hope that the paging won't make it too much slower. (But maybe it could help reduce memory usage, if each page can have a different number of bits per value?) I think I'll open a separate issue to make encoding to byte[] and decoding from byte[] byte-aligned (it is long-aligned today). I ran a benchmark with the Lucene41 PF and the deltas were small (probably noise).
          Hide
          Shai Erera added a comment -

          Thanks Adrien. From what I can tell, you implemented a PackedInts version of what used to be CategoryListCache (per-document ordinals). The patch has many changes to what I think is unrelated to facets code, but I get the basic idea. So if we had a CategoryListCache interface, we'd have several implementations thus far: StraightIntsCache (Mike's int[] version with offsets), PackedIntsCache (regular PackedInts) and AdriensEfficientPackedIntsCache (your version ). Right?

          Also, this issue started in order to explore an alternative encoder/decoder for per-document category ordinals. Is it ok to conclude that none seemed more efficient than dgap+vint?

          Regarding caching, it seems that the Straight impl beats everything so far, even packed-ints? I mean, it achieved 50% gains for some queries, while comparing HighTerm of both versions, straight achieves 38%, while packed only 4%. Yet both straight and packed versions consume exactly the same amount of RAM, so there's no real tradeoff here. It doesn't look like there's any advantage to using the packed version?

          Show
          Shai Erera added a comment - Thanks Adrien. From what I can tell, you implemented a PackedInts version of what used to be CategoryListCache (per-document ordinals). The patch has many changes to what I think is unrelated to facets code, but I get the basic idea. So if we had a CategoryListCache interface, we'd have several implementations thus far: StraightIntsCache (Mike's int[] version with offsets), PackedIntsCache (regular PackedInts) and AdriensEfficientPackedIntsCache (your version ). Right? Also, this issue started in order to explore an alternative encoder/decoder for per-document category ordinals. Is it ok to conclude that none seemed more efficient than dgap+vint? Regarding caching, it seems that the Straight impl beats everything so far, even packed-ints? I mean, it achieved 50% gains for some queries, while comparing HighTerm of both versions, straight achieves 38%, while packed only 4%. Yet both straight and packed versions consume exactly the same amount of RAM, so there's no real tradeoff here. It doesn't look like there's any advantage to using the packed version?
          Hide
          Gilad Barkai added a comment - - edited

          Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder:
          Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt).

          The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario.

          Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code.

          I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the EncodingSpeed test - and was surprised.
          While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat SimpleIntEncoder (which writes plain 32 bits per value) in decoding time.

          Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia).

          Mike - could you please try this encoder? I guess it only makes sense to run the specialized DGapSemiPackedEncoder.
          Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower.

          Here are the results of the EncodingSpeed test:

          Estimating ~100000000 Integers compression time by
          Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times.
          
          Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                                        [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
          -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Simple                                                          32.0000                  190                     1.9000                  165                     1.6500
          VInt8                                                           18.4955                  436                     4.3600                  359                     3.5900
          Sorting(Unique(VInt8))                                          18.4955                 3557                    35.5702                  314                     3.1400
          Sorting(Unique(DGap(VInt8)))                                     8.5597                 3485                    34.8502                  270                     2.7000
          Sorting(Unique(DGapVInt8))                                       8.5597                 3434                    34.3402                  192                     1.9200
          Sorting(Unique(DGap(SemiPacked)))                                8.6453                 3386                    33.8602                  156                     1.5600
          Sorting(Unique(DGapSemiPacked))                                  8.6453                 3397                    33.9702                   99                     0.9900
          Sorting(Unique(DGap(EightFlags(VInt))))                          4.9679                 4002                    40.0203                  381                     3.8100
          Sorting(Unique(DGap(FourFlags(VInt))))                           4.8198                 3972                    39.7203                  399                     3.9900
          Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                4.5794                 4448                    44.4803                  645                     6.4500
          Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                4.5794                 4461                    44.6103                  641                     6.4100
          
          
          Estimating ~100000000 Integers compression time by
          Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times.
          
          Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                                        [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
          -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Simple                                                          32.0000                  169                     1.6900                  163                     1.6300
          VInt8                                                           18.2673                  421                     4.2100                  374                     3.7400
          Sorting(Unique(VInt8))                                          18.2673                 2230                    22.3001                  337                     3.3700
          Sorting(Unique(DGap(VInt8)))                                     8.9456                 2257                    22.5701                  273                     2.7300
          Sorting(Unique(DGapVInt8))                                       8.9456                 2214                    22.1401                  192                     1.9200
          Sorting(Unique(DGap(SemiPacked)))                                9.2787                 2162                    21.6201                  180                     1.8000
          Sorting(Unique(DGapSemiPacked))                                  9.2787                 2148                    21.4801                  120                     1.2000
          Sorting(Unique(DGap(EightFlags(VInt))))                          5.7542                 2937                    29.3701                  395                     3.9500
          Sorting(Unique(DGap(FourFlags(VInt))))                           5.5447                 2768                    27.6801                  407                     4.0700
          Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                5.3566                 3294                    32.9401                  651                     6.5100
          Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                5.3996                 3318                    33.1801                  662                     6.6200
          
          
          Estimating ~100000000 Integers compression time by
          Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times.
          
          Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                                        [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
          -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Simple                                                          32.0000                  316                     3.1600                  318                     3.1800
          VInt8                                                           20.8889                  569                     5.6900                  496                     4.9600
          Sorting(Unique(VInt8))                                          20.8889                 1196                    11.9600                  488                     4.8800
          Sorting(Unique(DGap(VInt8)))                                    12.0000                 1151                    11.5100                  481                     4.8100
          Sorting(Unique(DGapVInt8))                                      12.0000                 1100                    11.0000                  352                     3.5200
          Sorting(Unique(DGap(SemiPacked)))                               14.6667                 1107                    11.0700                  507                     5.0700
          Sorting(Unique(DGapSemiPacked))                                 14.6667                 1037                    10.3700                  439                     4.3900
          Sorting(Unique(DGap(EightFlags(VInt))))                         10.2222                 1315                    13.1500                  656                     6.5600
          Sorting(Unique(DGap(FourFlags(VInt))))                          10.2222                 1408                    14.0800                  675                     6.7500
          Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                9.7778                 1617                    16.1700                  990                     9.9000
          Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))               10.2222                 1708                    17.0800                  992                     9.9200
          
          
          Estimating ~100000000 Integers compression time by
          Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times.
          
          Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                                        [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
          -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Simple                                                          32.0000                  181                     1.8100                  168                     1.6800
          VInt8                                                           16.5768                  428                     4.2800                  351                     3.5100
          Sorting(Unique(VInt8))                                          16.5768                 1586                    15.8600                  330                     3.3000
          Sorting(Unique(DGap(VInt8)))                                     8.4848                 1477                    14.7700                  267                     2.6700
          Sorting(Unique(DGapVInt8))                                       8.4848                 1435                    14.3500                  193                     1.9300
          Sorting(Unique(DGap(SemiPacked)))                                8.6771                 1424                    14.2400                  159                     1.5900
          Sorting(Unique(DGapSemiPacked))                                  8.6771                 1402                    14.0200                  100                     1.0000
          Sorting(Unique(DGap(EightFlags(VInt))))                          4.4138                 1603                    16.0300                  376                     3.7600
          Sorting(Unique(DGap(FourFlags(VInt))))                           4.1797                 1700                    17.0000                  382                     3.8200
          Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                3.8955                 2011                    20.1100                  602                     6.0200
          Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                3.8871                 2024                    20.2400                  594                     5.9400
          
          
          

          I have another version which uses two bytes before using the variable length (e.g values smaller than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful.

          Show
          Gilad Barkai added a comment - - edited Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder: Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt). The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario. Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code. I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the EncodingSpeed test - and was surprised. While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat SimpleIntEncoder (which writes plain 32 bits per value) in decoding time. Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). Mike - could you please try this encoder? I guess it only makes sense to run the specialized DGapSemiPackedEncoder . Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower. Here are the results of the EncodingSpeed test: Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 190 1.9000 165 1.6500 VInt8 18.4955 436 4.3600 359 3.5900 Sorting(Unique(VInt8)) 18.4955 3557 35.5702 314 3.1400 Sorting(Unique(DGap(VInt8))) 8.5597 3485 34.8502 270 2.7000 Sorting(Unique(DGapVInt8)) 8.5597 3434 34.3402 192 1.9200 Sorting(Unique(DGap(SemiPacked))) 8.6453 3386 33.8602 156 1.5600 Sorting(Unique(DGapSemiPacked)) 8.6453 3397 33.9702 99 0.9900 Sorting(Unique(DGap(EightFlags(VInt)))) 4.9679 4002 40.0203 381 3.8100 Sorting(Unique(DGap(FourFlags(VInt)))) 4.8198 3972 39.7203 399 3.9900 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 4.5794 4448 44.4803 645 6.4500 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 4.5794 4461 44.6103 641 6.4100 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 169 1.6900 163 1.6300 VInt8 18.2673 421 4.2100 374 3.7400 Sorting(Unique(VInt8)) 18.2673 2230 22.3001 337 3.3700 Sorting(Unique(DGap(VInt8))) 8.9456 2257 22.5701 273 2.7300 Sorting(Unique(DGapVInt8)) 8.9456 2214 22.1401 192 1.9200 Sorting(Unique(DGap(SemiPacked))) 9.2787 2162 21.6201 180 1.8000 Sorting(Unique(DGapSemiPacked)) 9.2787 2148 21.4801 120 1.2000 Sorting(Unique(DGap(EightFlags(VInt)))) 5.7542 2937 29.3701 395 3.9500 Sorting(Unique(DGap(FourFlags(VInt)))) 5.5447 2768 27.6801 407 4.0700 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 5.3566 3294 32.9401 651 6.5100 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 5.3996 3318 33.1801 662 6.6200 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 316 3.1600 318 3.1800 VInt8 20.8889 569 5.6900 496 4.9600 Sorting(Unique(VInt8)) 20.8889 1196 11.9600 488 4.8800 Sorting(Unique(DGap(VInt8))) 12.0000 1151 11.5100 481 4.8100 Sorting(Unique(DGapVInt8)) 12.0000 1100 11.0000 352 3.5200 Sorting(Unique(DGap(SemiPacked))) 14.6667 1107 11.0700 507 5.0700 Sorting(Unique(DGapSemiPacked)) 14.6667 1037 10.3700 439 4.3900 Sorting(Unique(DGap(EightFlags(VInt)))) 10.2222 1315 13.1500 656 6.5600 Sorting(Unique(DGap(FourFlags(VInt)))) 10.2222 1408 14.0800 675 6.7500 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 9.7778 1617 16.1700 990 9.9000 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 10.2222 1708 17.0800 992 9.9200 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 181 1.8100 168 1.6800 VInt8 16.5768 428 4.2800 351 3.5100 Sorting(Unique(VInt8)) 16.5768 1586 15.8600 330 3.3000 Sorting(Unique(DGap(VInt8))) 8.4848 1477 14.7700 267 2.6700 Sorting(Unique(DGapVInt8)) 8.4848 1435 14.3500 193 1.9300 Sorting(Unique(DGap(SemiPacked))) 8.6771 1424 14.2400 159 1.5900 Sorting(Unique(DGapSemiPacked)) 8.6771 1402 14.0200 100 1.0000 Sorting(Unique(DGap(EightFlags(VInt)))) 4.4138 1603 16.0300 376 3.7600 Sorting(Unique(DGap(FourFlags(VInt)))) 4.1797 1700 17.0000 382 3.8200 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 3.8955 2011 20.1100 602 6.0200 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 3.8871 2024 20.2400 594 5.9400 I have another version which uses two bytes before using the variable length (e.g values smaller than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful.
          Hide
          Shai Erera added a comment -

          That will be interesting to test. In order to test it "fairly", we should either test the decoder (that's what we usually test) through the abstracted code (i.e. via CategoryListIterator), or Gilad, if you can, please copy CountingFacetsCollector and inline the decoder code instead of the dgap+vint code? That will be simpler to test, with the least noise.

          I'm not sure SimpleIntEncoder was ever used

          Mike and I tested it ... at some point . I don't remember where we posted the results though, whether it was in email, GTalk or some issue. But I do remember that the results were less good than DGapVInt. We were always surprised by how fast DGapVInt is .. all the while we thought VInt is expensive, but it may not be so expensive ... at least not on the Wikipedia collection.

          the decoding speed is significantly faster

          That's good, but Mike and I have already concluded that EncodingSpeed just .. lies . It's a micro-benchmark, and while it showed significant improvements after I moved the encoders to bulk-API, on the real-world scenario it performed worse. I had to inline stuff and specialize it even more for it to beat the previous way things worked.

          I will be glad if SemiPacked is faster .. but judging from past experience, I don't get my hopes too high .

          As for this encoding algorithm, it all depends on how many values actually fall into the 256 range. That's another problem w/ EncodingSpeed – it uses a real-world scenario of a crazy application which encoded 2430 ordinals for a single document! You can see that the values that are encoded are small, by e.g. looking at the NOnes bits/int. I suspect that in real-life, there won't be many values that fall into that range, at least after some documents have been indexed, because when you have a single category per-dimension in a document, then there are not too many chances that their values will be "close".

          But .. we should let luceneutil be the judge of that . So Gilad, can you make a patch with a SemiPackedCountingCollector? And also modify the default that FacetCollector.create returns, so that it's easy to compare base (CountinFC) to comp (SemiPackedCFC). If you want to test the collector, then run TestDemoFacets (as-is) and CountingFCTest (modify the collector though) to make sure the Collector works.

          Show
          Shai Erera added a comment - That will be interesting to test. In order to test it "fairly", we should either test the decoder (that's what we usually test) through the abstracted code (i.e. via CategoryListIterator), or Gilad, if you can, please copy CountingFacetsCollector and inline the decoder code instead of the dgap+vint code? That will be simpler to test, with the least noise. I'm not sure SimpleIntEncoder was ever used Mike and I tested it ... at some point . I don't remember where we posted the results though, whether it was in email, GTalk or some issue. But I do remember that the results were less good than DGapVInt. We were always surprised by how fast DGapVInt is .. all the while we thought VInt is expensive, but it may not be so expensive ... at least not on the Wikipedia collection. the decoding speed is significantly faster That's good, but Mike and I have already concluded that EncodingSpeed just .. lies . It's a micro-benchmark, and while it showed significant improvements after I moved the encoders to bulk-API, on the real-world scenario it performed worse. I had to inline stuff and specialize it even more for it to beat the previous way things worked. I will be glad if SemiPacked is faster .. but judging from past experience, I don't get my hopes too high . As for this encoding algorithm, it all depends on how many values actually fall into the 256 range. That's another problem w/ EncodingSpeed – it uses a real-world scenario of a crazy application which encoded 2430 ordinals for a single document! You can see that the values that are encoded are small, by e.g. looking at the NOnes bits/int. I suspect that in real-life, there won't be many values that fall into that range, at least after some documents have been indexed, because when you have a single category per-dimension in a document, then there are not too many chances that their values will be "close". But .. we should let luceneutil be the judge of that . So Gilad, can you make a patch with a SemiPackedCountingCollector? And also modify the default that FacetCollector.create returns, so that it's easy to compare base (CountinFC) to comp (SemiPackedCFC). If you want to test the collector, then run TestDemoFacets (as-is) and CountingFCTest (modify the collector though) to make sure the Collector works.
          Hide
          Shai Erera added a comment -

          Few comments about the patch:

          • instead of iterating from 0 to length and doing values.ints[values.offset + i], you can compute 'upto' once and iterate from values.offset to upto?
          • is if (v & 0xFF) == 0) better than if (v <= 256)? we could test of course ...
          • I think that doing v - 256 is a good idea? You anyway do dgap, so what's another '-' (and '+' in decoder)?
          • Maybe add a test to EncodingTest? Just to be sure it passes the random testing?
          Show
          Shai Erera added a comment - Few comments about the patch: instead of iterating from 0 to length and doing values.ints [values.offset + i] , you can compute 'upto' once and iterate from values.offset to upto? is if (v & 0xFF) == 0) better than if (v <= 256) ? we could test of course ... I think that doing v - 256 is a good idea? You anyway do dgap, so what's another '-' (and '+' in decoder)? Maybe add a test to EncodingTest? Just to be sure it passes the random testing?
          Hide
          Michael McCandless added a comment -

          OK the new format doesn't do very well. This is all wikipedia (6.6M "big" docs), 7 facet dims:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                           MedTerm       46.85      (2.4%)       28.22      (0.7%)  -39.8% ( -41% -  -37%)
                          HighTerm       19.09      (2.5%)       12.27      (0.9%)  -35.7% ( -38% -  -33%)
                         OrHighLow       16.83      (2.8%)       11.21      (1.0%)  -33.4% ( -36% -  -30%)
                         OrHighMed       16.35      (2.8%)       11.00      (1.0%)  -32.7% ( -35% -  -29%)
                           Prefix3       12.87      (2.8%)        8.81      (0.9%)  -31.5% ( -34% -  -28%)
                          Wildcard       27.22      (2.2%)       18.68      (0.7%)  -31.4% ( -33% -  -29%)
                           LowTerm      110.58      (1.8%)       79.25      (0.6%)  -28.3% ( -30% -  -26%)
                        OrHighHigh        8.61      (2.9%)        6.19      (1.3%)  -28.1% ( -31% -  -24%)
                            IntNRQ        3.54      (2.9%)        2.55      (1.2%)  -27.9% ( -31% -  -24%)
                       AndHighHigh       23.19      (1.4%)       17.67      (0.7%)  -23.8% ( -25% -  -22%)
                            Fuzzy1       46.94      (1.7%)       40.34      (1.6%)  -14.1% ( -17% -  -10%)
                         MedPhrase      110.00      (5.6%)       98.08      (4.2%)  -10.8% ( -19% -   -1%)
                   MedSloppyPhrase       25.93      (2.5%)       23.37      (1.6%)   -9.9% ( -13% -   -5%)
                       MedSpanNear       28.43      (2.5%)       25.68      (1.2%)   -9.7% ( -13% -   -6%)
                        AndHighMed      105.06      (0.9%)       95.74      (1.0%)   -8.9% ( -10% -   -7%)
                         LowPhrase       21.26      (6.2%)       19.86      (5.3%)   -6.6% ( -16% -    5%)
                      HighSpanNear        3.53      (2.0%)        3.30      (1.2%)   -6.5% (  -9% -   -3%)
                            Fuzzy2       52.61      (2.6%)       49.64      (2.5%)   -5.6% ( -10% -    0%)
                        HighPhrase       17.44     (10.2%)       16.66      (9.5%)   -4.5% ( -21% -   16%)
                  HighSloppyPhrase        0.92      (7.3%)        0.88      (5.7%)   -4.5% ( -16% -    9%)
                   LowSloppyPhrase       20.28      (3.1%)       19.59      (2.0%)   -3.4% (  -8% -    1%)
                           Respell       46.30      (3.2%)       45.27      (3.4%)   -2.2% (  -8% -    4%)
                       LowSpanNear        8.36      (2.8%)        8.20      (1.9%)   -1.9% (  -6% -    2%)
                        AndHighLow      578.66      (3.0%)      569.71      (3.1%)   -1.5% (  -7% -    4%)
          

          Also it's quite a bit more RAM / disk consuming: 306 MB of .dvm/d files on disk vs 178 MB for trunk (and remember that part of this is the title SortedDV field.

          Show
          Michael McCandless added a comment - OK the new format doesn't do very well. This is all wikipedia (6.6M "big" docs), 7 facet dims: Task QPS base StdDev QPS comp StdDev Pct diff MedTerm 46.85 (2.4%) 28.22 (0.7%) -39.8% ( -41% - -37%) HighTerm 19.09 (2.5%) 12.27 (0.9%) -35.7% ( -38% - -33%) OrHighLow 16.83 (2.8%) 11.21 (1.0%) -33.4% ( -36% - -30%) OrHighMed 16.35 (2.8%) 11.00 (1.0%) -32.7% ( -35% - -29%) Prefix3 12.87 (2.8%) 8.81 (0.9%) -31.5% ( -34% - -28%) Wildcard 27.22 (2.2%) 18.68 (0.7%) -31.4% ( -33% - -29%) LowTerm 110.58 (1.8%) 79.25 (0.6%) -28.3% ( -30% - -26%) OrHighHigh 8.61 (2.9%) 6.19 (1.3%) -28.1% ( -31% - -24%) IntNRQ 3.54 (2.9%) 2.55 (1.2%) -27.9% ( -31% - -24%) AndHighHigh 23.19 (1.4%) 17.67 (0.7%) -23.8% ( -25% - -22%) Fuzzy1 46.94 (1.7%) 40.34 (1.6%) -14.1% ( -17% - -10%) MedPhrase 110.00 (5.6%) 98.08 (4.2%) -10.8% ( -19% - -1%) MedSloppyPhrase 25.93 (2.5%) 23.37 (1.6%) -9.9% ( -13% - -5%) MedSpanNear 28.43 (2.5%) 25.68 (1.2%) -9.7% ( -13% - -6%) AndHighMed 105.06 (0.9%) 95.74 (1.0%) -8.9% ( -10% - -7%) LowPhrase 21.26 (6.2%) 19.86 (5.3%) -6.6% ( -16% - 5%) HighSpanNear 3.53 (2.0%) 3.30 (1.2%) -6.5% ( -9% - -3%) Fuzzy2 52.61 (2.6%) 49.64 (2.5%) -5.6% ( -10% - 0%) HighPhrase 17.44 (10.2%) 16.66 (9.5%) -4.5% ( -21% - 16%) HighSloppyPhrase 0.92 (7.3%) 0.88 (5.7%) -4.5% ( -16% - 9%) LowSloppyPhrase 20.28 (3.1%) 19.59 (2.0%) -3.4% ( -8% - 1%) Respell 46.30 (3.2%) 45.27 (3.4%) -2.2% ( -8% - 4%) LowSpanNear 8.36 (2.8%) 8.20 (1.9%) -1.9% ( -6% - 2%) AndHighLow 578.66 (3.0%) 569.71 (3.1%) -1.5% ( -7% - 4%) Also it's quite a bit more RAM / disk consuming: 306 MB of .dvm/d files on disk vs 178 MB for trunk (and remember that part of this is the title SortedDV field.
          Hide
          Shai Erera added a comment -

          Well ... this encoding may be better if really you had many more values in the range 128-256 (which w/ vint are encoded w/ 2 bytes). But that seems arbitrary .. why those values?

          I am not sure if we should keep this encoder, b/c it consumes more space and even in EncodingSpeed, which indexes categories that are very close to each other, vint achieved slightly better compression.

          I am not sure anymore that for an arbitrary set of integers, we can do better than VInt (well, just b/c of all the attempts we've had ).

          Show
          Shai Erera added a comment - Well ... this encoding may be better if really you had many more values in the range 128-256 (which w/ vint are encoded w/ 2 bytes). But that seems arbitrary .. why those values? I am not sure if we should keep this encoder, b/c it consumes more space and even in EncodingSpeed, which indexes categories that are very close to each other, vint achieved slightly better compression. I am not sure anymore that for an arbitrary set of integers, we can do better than VInt (well, just b/c of all the attempts we've had ).
          Hide
          Shai Erera added a comment -

          Closing as "Not A Problem". All the attempts didn't show that there's a better encoding than VInt for categories, because their indexing nature (including range of values) is unexpected. VInt is fast and achieves good compression. We can re-open it in the future if we want.

          Show
          Shai Erera added a comment - Closing as "Not A Problem". All the attempts didn't show that there's a better encoding than VInt for categories, because their indexing nature (including range of values) is unexpected. VInt is fast and achieves good compression. We can re-open it in the future if we want.

            People

            • Assignee:
              Unassigned
              Reporter:
              Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development