Lucene - Core
  1. Lucene - Core
  2. LUCENE-4225

New FixedPostingsFormat for less overhead than SepPostingsFormat

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've worked out the start at a new postings format that should have
      less overhead for fixed-int[] encoders (For,PFor)... using ideas from
      the old bulk branch, and new ideas from Robert.

      It's only a start: there's no payloads support yet, and I haven't run
      Lucene's tests with it, except for one new test I added that tries to
      be a thorough PostingsFormat tester (to make it easier to create new
      postings formats). It does pass luceneutil's performance test, so
      it's at least able to run those queries correctly...

      Like Lucene40, it uses two files (though once we add payloads it may
      be 3). The .doc file interleaves doc delta and freq blocks, and .pos
      has position delta blocks. Unlike sep, blocks are NOT shared across
      terms; instead, it uses block encoding if there are enough ints to
      encode, else the same Lucene40 vInt format. This means low-freq terms
      (< 128 = current default block size) are always vInts, and high-freq
      terms will have some number of blocks, with a vInt final block.

      Skip points are only recorded at block starts.

      1. LUCENE-4225.patch
        126 kB
        Michael McCandless
      2. LUCENE-4225.patch
        91 kB
        Michael McCandless
      3. LUCENE-4225.patch
        90 kB
        Michael McCandless
      4. LUCENE-4225-on-rev-1362013.patch
        93 kB
        Han Jiang
      5. LUCENE-4225.patch
        95 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Initial patch, lots of nocommits ...

          Patch is against the LUCENE-3892 branch.

          Show
          Michael McCandless added a comment - Initial patch, lots of nocommits ... Patch is against the LUCENE-3892 branch.
          Hide
          Michael McCandless added a comment -

          Initial results are compelling! On the 10M doc Wikipedia test,
          Sep(For) vs Fixed(For):

                          Task    QPS base StdDev base     QPS for  StdDev for      Pct diff
                        IntNRQ        8.40        0.83        8.33        0.38  -13% -   15%
                   TermGroup1M       46.67        1.51       48.95        0.21    1% -    8%
                  TermBGroup1M       79.97        1.96       85.05        0.52    3% -    9%
                       Prefix3       68.82        2.62       73.96        2.27    0% -   15%
                        Fuzzy2       69.54        2.69       75.55        2.29    1% -   16%
                TermBGroup1M1P       42.67        1.07       46.38        0.86    4% -   13%
                        Fuzzy1       85.07        3.34       93.16        2.20    2% -   16%
                       Respell       67.30        2.20       74.69        3.87    1% -   20%
                          Term      156.81        8.62      180.38        6.83    4% -   26%
                      Wildcard       42.55        1.13       50.97        0.87   14% -   25%
                    OrHighHigh        8.66        0.77       10.46        0.59    4% -   40%
                     OrHighMed       15.62        1.54       18.93        1.05    4% -   41%
                    AndHighMed       45.80        1.69       57.18        0.80   18% -   31%
                      SpanNear        7.59        0.32        9.95        0.14   23% -   38%
                   AndHighHigh       11.09        0.32       14.68        0.15   27% -   37%
                      PKLookup      143.83        2.80      195.40        4.13   30% -   41%
                        Phrase       15.53        1.15       21.34        0.18   26% -   49%
                  SloppyPhrase        5.94        0.49        8.74        0.24   32% -   64%
          

          And Fixed(For) vs Lucene40:

                          Task    QPS base StdDev base     QPS for  StdDev for      Pct diff
                    AndHighMed       60.07        1.69       44.20        1.17  -30% -  -22%
                        Phrase       11.97        0.60        9.61        0.20  -25% -  -13%
                        IntNRQ        9.77        0.46        8.93        0.38  -16% -    0%
                        Fuzzy2       49.08        1.33       48.72        1.08   -5% -    4%
                       Respell       61.33        1.52       60.90        1.41   -5% -    4%
                      SpanNear        7.72        0.20        7.74        0.07   -3% -    3%
                      PKLookup      194.64        3.03      197.83        3.27   -1% -    4%
                  SloppyPhrase        4.76        0.19        4.93        0.11   -2% -   10%
                        Fuzzy1       63.49        1.07       66.57        1.53    0% -    9%
                   TermGroup1M       53.91        1.40       58.24        1.27    3% -   13%
                       Prefix3       61.02        1.72       66.14        2.11    2% -   15%
                      Wildcard       51.27        1.40       56.26        1.78    3% -   16%
                TermBGroup1M1P       29.65        0.98       32.77        0.79    4% -   17%
                  TermBGroup1M       34.37        1.16       38.07        1.14    3% -   18%
                          Term       24.98        1.32       28.13        3.31   -5% -   32%
                   AndHighHigh       17.08        0.69       19.42        0.52    6% -   21%
                    OrHighHigh       10.68        0.40       12.52        0.94    4% -   30%
                     OrHighMed       13.66        0.52       16.65        1.34    7% -   36%
          

          So we are still slower than Lucene40 in some cases, but a lot closer
          than with Sep!

          But these are early results ... and the PF doesn't pass tests yet ... so!

          Show
          Michael McCandless added a comment - Initial results are compelling! On the 10M doc Wikipedia test, Sep(For) vs Fixed(For): Task QPS base StdDev base QPS for StdDev for Pct diff IntNRQ 8.40 0.83 8.33 0.38 -13% - 15% TermGroup1M 46.67 1.51 48.95 0.21 1% - 8% TermBGroup1M 79.97 1.96 85.05 0.52 3% - 9% Prefix3 68.82 2.62 73.96 2.27 0% - 15% Fuzzy2 69.54 2.69 75.55 2.29 1% - 16% TermBGroup1M1P 42.67 1.07 46.38 0.86 4% - 13% Fuzzy1 85.07 3.34 93.16 2.20 2% - 16% Respell 67.30 2.20 74.69 3.87 1% - 20% Term 156.81 8.62 180.38 6.83 4% - 26% Wildcard 42.55 1.13 50.97 0.87 14% - 25% OrHighHigh 8.66 0.77 10.46 0.59 4% - 40% OrHighMed 15.62 1.54 18.93 1.05 4% - 41% AndHighMed 45.80 1.69 57.18 0.80 18% - 31% SpanNear 7.59 0.32 9.95 0.14 23% - 38% AndHighHigh 11.09 0.32 14.68 0.15 27% - 37% PKLookup 143.83 2.80 195.40 4.13 30% - 41% Phrase 15.53 1.15 21.34 0.18 26% - 49% SloppyPhrase 5.94 0.49 8.74 0.24 32% - 64% And Fixed(For) vs Lucene40: Task QPS base StdDev base QPS for StdDev for Pct diff AndHighMed 60.07 1.69 44.20 1.17 -30% - -22% Phrase 11.97 0.60 9.61 0.20 -25% - -13% IntNRQ 9.77 0.46 8.93 0.38 -16% - 0% Fuzzy2 49.08 1.33 48.72 1.08 -5% - 4% Respell 61.33 1.52 60.90 1.41 -5% - 4% SpanNear 7.72 0.20 7.74 0.07 -3% - 3% PKLookup 194.64 3.03 197.83 3.27 -1% - 4% SloppyPhrase 4.76 0.19 4.93 0.11 -2% - 10% Fuzzy1 63.49 1.07 66.57 1.53 0% - 9% TermGroup1M 53.91 1.40 58.24 1.27 3% - 13% Prefix3 61.02 1.72 66.14 2.11 2% - 15% Wildcard 51.27 1.40 56.26 1.78 3% - 16% TermBGroup1M1P 29.65 0.98 32.77 0.79 4% - 17% TermBGroup1M 34.37 1.16 38.07 1.14 3% - 18% Term 24.98 1.32 28.13 3.31 -5% - 32% AndHighHigh 17.08 0.69 19.42 0.52 6% - 21% OrHighHigh 10.68 0.40 12.52 0.94 4% - 30% OrHighMed 13.66 0.52 16.65 1.34 7% - 36% So we are still slower than Lucene40 in some cases, but a lot closer than with Sep! But these are early results ... and the PF doesn't pass tests yet ... so!
          Hide
          Robert Muir added a comment -

          Looks good Mike. I think the slower cases are all explained: the skip interval is crazy, and lazy-loading the freq blocks should fix IntNRQ. (Though, i dont know how you get away with AndHighHigh currently).

          Still the second benchmark could be confusing: we are mixing concerns benchmarking FOR vs Vint and also different index layouts
          Maybe we can we benchmark this layout with bulkvint vs Lucene40 to get a better idea of just how the index layout is doing?

          I like how clean it is without the payloads crap: I still think we probably need to know up-front if the consumer is going to consume a payload off the enum for positional queries, without that its going to make things like this really hairy and messy.

          Do you think its worth it that even for "big terms" we write the last partial block as vints the way we do?
          Since these terms are going to be biggish anyway (at least enough to fill a block), this seems not worth the trouble?

          Instead if we only did this for low-freq terms, the code might even be clearer/faster, but I guess there would be a downside of
          not being able to reuse these enums as much that would hurt e.g. NIOFSDirectory?

          Thanks for bringing all this back to life... and the new test looks awesome! I think it will really make our lives a lot easier...

          Show
          Robert Muir added a comment - Looks good Mike. I think the slower cases are all explained: the skip interval is crazy, and lazy-loading the freq blocks should fix IntNRQ. (Though, i dont know how you get away with AndHighHigh currently). Still the second benchmark could be confusing: we are mixing concerns benchmarking FOR vs Vint and also different index layouts Maybe we can we benchmark this layout with bulkvint vs Lucene40 to get a better idea of just how the index layout is doing? I like how clean it is without the payloads crap: I still think we probably need to know up-front if the consumer is going to consume a payload off the enum for positional queries, without that its going to make things like this really hairy and messy. Do you think its worth it that even for "big terms" we write the last partial block as vints the way we do? Since these terms are going to be biggish anyway (at least enough to fill a block), this seems not worth the trouble? Instead if we only did this for low-freq terms, the code might even be clearer/faster, but I guess there would be a downside of not being able to reuse these enums as much that would hurt e.g. NIOFSDirectory? Thanks for bringing all this back to life... and the new test looks awesome! I think it will really make our lives a lot easier...
          Hide
          Robert Muir added a comment -

          By the way: I also really like how clean the code is. Lets see if we can keep that, its really nice!

          We should seriously balance this against any little optimizations we can do.

          Show
          Robert Muir added a comment - By the way: I also really like how clean the code is. Lets see if we can keep that, its really nice! We should seriously balance this against any little optimizations we can do.
          Hide
          Robert Muir added a comment -

          Some more ideas for payloads:

          I don't like how we double every position in the payloads case to record if there is one there, and we shouldnt also
          have a condition to indicate if the length changed. I think practically its typically "all or none", e.g. the analysis
          process marks a payload like POS or it doesnt, and a fixed length across the whole term or not. So I don't think we
          should waste time with this for block encoders, nor should we put this in skipdata. I think we should just do something
          simpler, like if payloads are present, we have a block of lengths. Its a 0 if there is no payload. If all the payloads
          for the entire term are the same, mark that length in the term dictionary and omit the lengths blocks.

          We could consider the same approach for offset length.

          Show
          Robert Muir added a comment - Some more ideas for payloads: I don't like how we double every position in the payloads case to record if there is one there, and we shouldnt also have a condition to indicate if the length changed. I think practically its typically "all or none", e.g. the analysis process marks a payload like POS or it doesnt, and a fixed length across the whole term or not. So I don't think we should waste time with this for block encoders, nor should we put this in skipdata. I think we should just do something simpler, like if payloads are present, we have a block of lengths. Its a 0 if there is no payload. If all the payloads for the entire term are the same, mark that length in the term dictionary and omit the lengths blocks. We could consider the same approach for offset length.
          Hide
          Michael McCandless added a comment -

          I think the slower cases are all explained: the skip interval is crazy, and lazy-loading the freq blocks should fix IntNRQ. (Though, i dont know how you get away with AndHighHigh currently).

          Maybe AndHighHigh isn't doing much actual skipping... ie the distance
          b/w each doc is probably around the blockSize?

          I wonder even how much skipping AndMedHigh queries are really
          doing... but I agree we need to have a smaller skipInterval since our
          "base" skipInterval is so high.

          And try a smaller block size...

          Still the second benchmark could be confusing: we are mixing concerns benchmarking FOR vs Vint and also different index layouts
          Maybe we can we benchmark this layout with bulkvint vs Lucene40 to get a better idea of just how the index layout is doing?

          Oh yeah! OK I cutover BulkVInt to fixed postings format and compared
          it (base) to FOR:

                          Task    QPS base StdDev base     QPS for  StdDev for      Pct diff
                  SloppyPhrase        6.90        0.18        6.88        0.17   -5% -    4%
                      PKLookup      196.92        4.41      197.38        4.55   -4% -    4%
                       Respell       65.25        2.09       65.55        0.80   -3% -    5%
                   TermGroup1M       39.07        0.78       39.34        0.94   -3% -    5%
                      SpanNear        5.42        0.14        5.48        0.12   -3% -    6%
                  TermBGroup1M       44.91        0.44       45.45        0.51    0% -    3%
                TermBGroup1M1P       40.42        0.68       40.95        0.76   -2% -    4%
                        Fuzzy2       63.85        1.14       65.01        0.66    0% -    4%
                        Phrase       10.23        0.27       10.46        0.33   -3% -    8%
                        Fuzzy1       61.89        1.06       63.60        0.61    0% -    5%
                        IntNRQ        8.77        0.23        9.02        0.36   -3% -    9%
                      Wildcard       29.22        0.40       30.18        0.84    0% -    7%
                   AndHighHigh        9.13        0.15        9.49        0.18    0% -    7%
                          Term      126.40        0.41      132.48        5.62    0% -    9%
                       Prefix3       30.54        0.69       32.21        1.06    0% -   11%
                    OrHighHigh        8.69        0.38        9.21        0.37   -2% -   15%
                     OrHighMed       28.00        1.15       29.67        1.05   -1% -   14%
                    AndHighMed       32.28        0.67       34.29        0.56    2% -   10%
          

          Looks like some small gain over BulkVInt but not much...

          I like how clean it is without the payloads crap: I still think we probably need to know up-front if the consumer is going to consume a payload off the enum for positional queries, without that its going to make things like this really hairy and messy.

          I agree! Not looking forward to getting payloads working

          Do you think its worth it that even for "big terms" we write the last partial block as vints the way we do?
          Since these terms are going to be biggish anyway (at least enough to fill a block), this seems not worth the trouble?

          We could try just leaving partial blocks at the end ... that made me
          nervous I think there are a lot of terms in the 128 - 256 docFreq
          range! But we should try it.

          Instead if we only did this for low-freq terms, the code might even be clearer/faster, but I guess there would be a downside of
          not being able to reuse these enums as much that would hurt e.g. NIOFSDirectory?

          Hmm true. We'd need to pair up low and high freq enums? (Like Pulsing).

          Thanks for bringing all this back to life... and the new test looks awesome! I think it will really make our lives a lot easier...

          I really want this test to be thorough, so that if it passes on your
          new PF, all other tests should too! I know that's overly ambitious
          ... but when it misses something we should go back and add it.
          Because debugging a PF bug when you're in a deep scary stack trace
          involving Span*Query is a slow process ... it's too hard to make a new
          PF now.

          I don't like how we double every position in the payloads case to record if there is one there, and we shouldnt also
          have a condition to indicate if the length changed. I think practically its typically "all or none", e.g. the analysis
          process marks a payload like POS or it doesnt, and a fixed length across the whole term or not. So I don't think we
          should waste time with this for block encoders, nor should we put this in skipdata. I think we should just do something
          simpler, like if payloads are present, we have a block of lengths. Its a 0 if there is no payload. If all the payloads
          for the entire term are the same, mark that length in the term dictionary and omit the lengths blocks.

          We could consider the same approach for offset length.

          That sounds good!

          Show
          Michael McCandless added a comment - I think the slower cases are all explained: the skip interval is crazy, and lazy-loading the freq blocks should fix IntNRQ. (Though, i dont know how you get away with AndHighHigh currently). Maybe AndHighHigh isn't doing much actual skipping... ie the distance b/w each doc is probably around the blockSize? I wonder even how much skipping AndMedHigh queries are really doing... but I agree we need to have a smaller skipInterval since our "base" skipInterval is so high. And try a smaller block size... Still the second benchmark could be confusing: we are mixing concerns benchmarking FOR vs Vint and also different index layouts Maybe we can we benchmark this layout with bulkvint vs Lucene40 to get a better idea of just how the index layout is doing? Oh yeah! OK I cutover BulkVInt to fixed postings format and compared it (base) to FOR: Task QPS base StdDev base QPS for StdDev for Pct diff SloppyPhrase 6.90 0.18 6.88 0.17 -5% - 4% PKLookup 196.92 4.41 197.38 4.55 -4% - 4% Respell 65.25 2.09 65.55 0.80 -3% - 5% TermGroup1M 39.07 0.78 39.34 0.94 -3% - 5% SpanNear 5.42 0.14 5.48 0.12 -3% - 6% TermBGroup1M 44.91 0.44 45.45 0.51 0% - 3% TermBGroup1M1P 40.42 0.68 40.95 0.76 -2% - 4% Fuzzy2 63.85 1.14 65.01 0.66 0% - 4% Phrase 10.23 0.27 10.46 0.33 -3% - 8% Fuzzy1 61.89 1.06 63.60 0.61 0% - 5% IntNRQ 8.77 0.23 9.02 0.36 -3% - 9% Wildcard 29.22 0.40 30.18 0.84 0% - 7% AndHighHigh 9.13 0.15 9.49 0.18 0% - 7% Term 126.40 0.41 132.48 5.62 0% - 9% Prefix3 30.54 0.69 32.21 1.06 0% - 11% OrHighHigh 8.69 0.38 9.21 0.37 -2% - 15% OrHighMed 28.00 1.15 29.67 1.05 -1% - 14% AndHighMed 32.28 0.67 34.29 0.56 2% - 10% Looks like some small gain over BulkVInt but not much... I like how clean it is without the payloads crap: I still think we probably need to know up-front if the consumer is going to consume a payload off the enum for positional queries, without that its going to make things like this really hairy and messy. I agree! Not looking forward to getting payloads working Do you think its worth it that even for "big terms" we write the last partial block as vints the way we do? Since these terms are going to be biggish anyway (at least enough to fill a block), this seems not worth the trouble? We could try just leaving partial blocks at the end ... that made me nervous I think there are a lot of terms in the 128 - 256 docFreq range! But we should try it. Instead if we only did this for low-freq terms, the code might even be clearer/faster, but I guess there would be a downside of not being able to reuse these enums as much that would hurt e.g. NIOFSDirectory? Hmm true. We'd need to pair up low and high freq enums? (Like Pulsing). Thanks for bringing all this back to life... and the new test looks awesome! I think it will really make our lives a lot easier... I really want this test to be thorough, so that if it passes on your new PF, all other tests should too! I know that's overly ambitious ... but when it misses something we should go back and add it. Because debugging a PF bug when you're in a deep scary stack trace involving Span*Query is a slow process ... it's too hard to make a new PF now. I don't like how we double every position in the payloads case to record if there is one there, and we shouldnt also have a condition to indicate if the length changed. I think practically its typically "all or none", e.g. the analysis process marks a payload like POS or it doesnt, and a fixed length across the whole term or not. So I don't think we should waste time with this for block encoders, nor should we put this in skipdata. I think we should just do something simpler, like if payloads are present, we have a block of lengths. Its a 0 if there is no payload. If all the payloads for the entire term are the same, mark that length in the term dictionary and omit the lengths blocks. We could consider the same approach for offset length. That sounds good!
          Hide
          Robert Muir added a comment -

          Looks like some small gain over BulkVInt but not much...

          Well, thats good enough to see which one is faster. Now lets nuke the abstraction and just
          make this FORPostingsFormat that uses PackedInts?

          I think its finally clear that its the abstractions here in the codec, not in the search API, that
          are slowing down bulk decompression

          Show
          Robert Muir added a comment - Looks like some small gain over BulkVInt but not much... Well, thats good enough to see which one is faster. Now lets nuke the abstraction and just make this FORPostingsFormat that uses PackedInts? I think its finally clear that its the abstractions here in the codec, not in the search API, that are slowing down bulk decompression
          Hide
          Han Jiang added a comment -

          The initial patch doesn't pass compilation on branch codes, after svn up -r 1362013.

          So I made some change to pass ant compile, however, still some test fails: http://pastebin.com/jdFecZm5

          Show
          Han Jiang added a comment - The initial patch doesn't pass compilation on branch codes, after svn up -r 1362013. So I made some change to pass ant compile, however, still some test fails: http://pastebin.com/jdFecZm5
          Hide
          Han Jiang added a comment -

          Oh...OK, those fails are all related to payload.

          Show
          Han Jiang added a comment - Oh...OK, those fails are all related to payload.
          Hide
          Michael McCandless added a comment -

          Woops, thanks Billy, I'll merge w/ my patch.

          Sorry, test failures are expected until I get payloads working ... I'll do that next but will take some time. Payloads always get tricky

          Show
          Michael McCandless added a comment - Woops, thanks Billy, I'll merge w/ my patch. Sorry, test failures are expected until I get payloads working ... I'll do that next but will take some time. Payloads always get tricky
          Hide
          Michael McCandless added a comment -

          Well, thats good enough to see which one is faster. Now lets nuke the abstraction and just
          make this FORPostingsFormat that uses PackedInts?

          OK I'll do that!

          Show
          Michael McCandless added a comment - Well, thats good enough to see which one is faster. Now lets nuke the abstraction and just make this FORPostingsFormat that uses PackedInts? OK I'll do that!
          Hide
          Michael McCandless added a comment -

          New patch, moving fixed -> block and hardwiring it to For int block encoding.

          Still need to do payloads...

          Show
          Michael McCandless added a comment - New patch, moving fixed -> block and hardwiring it to For int block encoding. Still need to do payloads...
          Hide
          Michael McCandless added a comment -

          New patch, adding Block PF to META-INF services, and fixing a bug in skipping.

          Show
          Michael McCandless added a comment - New patch, adding Block PF to META-INF services, and fixing a bug in skipping.
          Hide
          Han Jiang added a comment -

          Quite curious why index size is reduced in Block PF. Here is a comparison base on the 1M wikipedia data:

                            SepPF+For  BlockPF     
          skip_data_size    36M        n/a
          total_index_size  598M       540M
          

          Since in BlockPF, skip data is inlined into .doc files, it is interesting that considering this part of size, BlockPF will still get better compression rate.

          Also, as BlockPF uses different formats to store information for each term, we try to see how the data is actually stored. Here, we sum docFreq%128 for all terms to get the vInt encoded ints, and remaining ints will all be encoded as Block Format.

          Block encoded 88,326,528 ints 
          VInt encoded  39,929,349 ints
          
          Show
          Han Jiang added a comment - Quite curious why index size is reduced in Block PF. Here is a comparison base on the 1M wikipedia data: SepPF+For BlockPF skip_data_size 36M n/a total_index_size 598M 540M Since in BlockPF, skip data is inlined into .doc files, it is interesting that considering this part of size, BlockPF will still get better compression rate. Also, as BlockPF uses different formats to store information for each term, we try to see how the data is actually stored. Here, we sum docFreq%128 for all terms to get the vInt encoded ints, and remaining ints will all be encoded as Block Format. Block encoded 88,326,528 ints VInt encoded 39,929,349 ints
          Hide
          Michael McCandless added a comment -

          New patch w/ payloads & offsets working ... I think it's ready!

          Show
          Michael McCandless added a comment - New patch w/ payloads & offsets working ... I think it's ready!
          Hide
          Michael McCandless added a comment -

          Those are interesting numbers: I'm surprised so many postings end up block encoded.

          Block PF has far far less skip data (skipInterval=128 vs 16 for Sep), and since it only skips to doc/freq block starts it saves two bytes per skip point.

          Show
          Michael McCandless added a comment - Those are interesting numbers: I'm surprised so many postings end up block encoded. Block PF has far far less skip data (skipInterval=128 vs 16 for Sep), and since it only skips to doc/freq block starts it saves two bytes per skip point.
          Hide
          Michael McCandless added a comment -

          OK I committed this ... it has lots of nocommits still but we can iterate on the branch.

          Show
          Michael McCandless added a comment - OK I committed this ... it has lots of nocommits still but we can iterate on the branch.
          Hide
          Han Jiang added a comment -

          Just hit an error on BlockPostingsFormat, this should reproduce in latest branch

          ant test-core -Dtestcase=TestGraphTokenizers -Dtests.method=testDoubleMockGraphTokenFilterRandom -Dtests.seed=1FD78436D5E26B9A -Dtests.postingsformat=Block
          
          Show
          Han Jiang added a comment - Just hit an error on BlockPostingsFormat, this should reproduce in latest branch ant test-core -Dtestcase=TestGraphTokenizers -Dtests.method=testDoubleMockGraphTokenFilterRandom -Dtests.seed=1FD78436D5E26B9A -Dtests.postingsformat=Block
          Hide
          Michael McCandless added a comment -

          Thanks Billy, I'll dig...

          Show
          Michael McCandless added a comment - Thanks Billy, I'll dig...
          Hide
          Michael McCandless added a comment -

          OK I committed the fix: Block/PackedPF was incorrectly encoding offsets as startOffset - lastEndOffset. It must instead be startOffset - lastStartOffset because it is possible (though rare) for startOffset - lastEndOffset to be negative.

          I also separately committed a fix for NPEs that tests were hitting when the index didn't index payloads nor offsets. Tests should now pass for BlockPF and BlockPackedPF...

          Show
          Michael McCandless added a comment - OK I committed the fix: Block/PackedPF was incorrectly encoding offsets as startOffset - lastEndOffset. It must instead be startOffset - lastStartOffset because it is possible (though rare) for startOffset - lastEndOffset to be negative. I also separately committed a fix for NPEs that tests were hitting when the index didn't index payloads nor offsets. Tests should now pass for BlockPF and BlockPackedPF...
          Hide
          Han Jiang added a comment -

          OK, thanks Mike!

          Show
          Han Jiang added a comment - OK, thanks Mike!

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development