Lucene - Core
  1. Lucene - Core
  2. LUCENE-4678

FST should use paged byte[] instead of single contiguous byte[]

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 6.0
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The single byte[] we use today has several limitations, eg it limits us to < 2.1 GB FSTs (and suggesters in the wild are getting close to this limit), and it causes big RAM spikes during building when a the array has to grow.

      I took basically the same approach as LUCENE-3298, but I want to break out this patch separately from changing all int -> long for > 2.1 GB support.

      1. LUCENE-4678.patch
        12 kB
        Michael McCandless
      2. LUCENE-4678.patch
        70 kB
        Michael McCandless
      3. LUCENE-4678.patch
        93 kB
        Michael McCandless
      4. LUCENE-4678.patch
        69 kB
        Michael McCandless
      5. LUCENE-4678.patch
        3 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Patch, I think it's close to ready (no format change for the FST so no back compat).

          Show
          Michael McCandless added a comment - Patch, I think it's close to ready (no format change for the FST so no back compat).
          Hide
          Michael McCandless added a comment -

          Duh, wrong patch ... this one should be right.

          Show
          Michael McCandless added a comment - Duh, wrong patch ... this one should be right.
          Hide
          Dawid Weiss added a comment -

          This looks very cool! I looked at the patch briefly but I need to apply it to make sense of the whole picture.

          +              while(skip > 0) {
          +                buffer.writeByte((byte) 0);
          +                skip--;
          +              }
          

          this doesn't look particularly efficient but I didn't get the context where it's actually used from the patch so maybe it's all right.

          Show
          Dawid Weiss added a comment - This looks very cool! I looked at the patch briefly but I need to apply it to make sense of the whole picture. + while (skip > 0) { + buffer.writeByte(( byte ) 0); + skip--; + } this doesn't look particularly efficient but I didn't get the context where it's actually used from the patch so maybe it's all right.
          Hide
          Michael McCandless added a comment -

          this doesn't look particularly efficient

          I agree ... I have a new patch shortly that fixes this (goes back to writing directly into the BytesStore instead of buffering first in RAMOutputStream)...

          Show
          Michael McCandless added a comment - this doesn't look particularly efficient I agree ... I have a new patch shortly that fixes this (goes back to writing directly into the BytesStore instead of buffering first in RAMOutputStream)...
          Hide
          Michael McCandless added a comment -

          New patch, fixing the double-buffering the old patch had added, and adding a test for the new BytesStore ...

          I think it's ready!

          I'll commit only to trunk for now ... and backport to 4.2 once 4.1 branches and once this has baked some in trunk ...

          Show
          Michael McCandless added a comment - New patch, fixing the double-buffering the old patch had added, and adding a test for the new BytesStore ... I think it's ready! I'll commit only to trunk for now ... and backport to 4.2 once 4.1 branches and once this has baked some in trunk ...
          Hide
          Robert Muir added a comment -

          I'll commit only to trunk for now ... and backport to 4.2 once 4.1 branches and once this has baked some in trunk ...

          +1... the copyBytes is frightening though!

          What do you think of the FST.BytesReader -> FSTBytesReader? I'm just thinking it causes a lot of "api noise" (you can see it in the patch).
          Unfortunately lots of users have to create this thing to pass to methods on FST (e.g. findTargetArc).

          So if we kept it as FST.BytesReader they would be largely unaffected?

          Show
          Robert Muir added a comment - I'll commit only to trunk for now ... and backport to 4.2 once 4.1 branches and once this has baked some in trunk ... +1... the copyBytes is frightening though! What do you think of the FST.BytesReader -> FSTBytesReader? I'm just thinking it causes a lot of "api noise" (you can see it in the patch). Unfortunately lots of users have to create this thing to pass to methods on FST (e.g. findTargetArc). So if we kept it as FST.BytesReader they would be largely unaffected?
          Hide
          Michael McCandless added a comment -

          the copyBytes is frightening though!

          I know! But hopefully the random test catches any problems w/ it ... jenkins will tell us.

          So if we kept it as FST.BytesReader they would be largely unaffected?

          +1, I moved back to that ... no more noise ... I'll attach new patch shortly.

          Show
          Michael McCandless added a comment - the copyBytes is frightening though! I know! But hopefully the random test catches any problems w/ it ... jenkins will tell us. So if we kept it as FST.BytesReader they would be largely unaffected? +1, I moved back to that ... no more noise ... I'll attach new patch shortly.
          Hide
          Michael McCandless added a comment -

          New patch, move BytesReader back under FST. I think it's ready.

          Show
          Michael McCandless added a comment - New patch, move BytesReader back under FST. I think it's ready.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1432459

          LUCENE-4678: use paged byte[] under the hood for FST

          Show
          Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1432459 LUCENE-4678 : use paged byte[] under the hood for FST
          Hide
          Michael McCandless added a comment -

          Patch, fixing FST.pack to not double-buffer again, using the new BytesStore.truncate method to roll back the last N bytes ...

          Show
          Michael McCandless added a comment - Patch, fixing FST.pack to not double-buffer again, using the new BytesStore.truncate method to roll back the last N bytes ...
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1432646

          LUCENE-4678: add BytesStore.truncate and use it to not double-buffer during pack

          Show
          Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1432646 LUCENE-4678 : add BytesStore.truncate and use it to not double-buffer during pack
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1435141

          LUCENE-4677, LUCENE-4682, LUCENE-4678, LUCENE-3298: Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1435141 LUCENE-4677 , LUCENE-4682 , LUCENE-4678 , LUCENE-3298 : Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109
          Hide
          Steve Rowe added a comment -

          Looks like this can be resolved?

          Show
          Steve Rowe added a comment - Looks like this can be resolved?
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development