Solr
  1. Solr
  2. SOLR-2888

FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: spellchecker
    • Labels:
      None

      Description

      This issue incorporates several problems:

      • utf16 was used previously to store and lookup terms, now it is utf8
      • the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
      • the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
      • Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.
      1. SOLR-2888.patch
        103 kB
        Dawid Weiss
      2. SOLR-2888.patch
        101 kB
        Dawid Weiss
      3. SOLR-2888.patch
        107 kB
        Dawid Weiss
      4. SOLR-2888_backport.patch
        12 kB
        Robert Muir

        Issue Links

          Activity

          Dawid Weiss created issue -
          Dawid Weiss made changes -
          Field Original Value New Value
          Link This issue is blocked by SOLR-2887 [ SOLR-2887 ]
          Hide
          Dawid Weiss added a comment -

          Depends on external sort because that will imply the order of terms used for fst construction/ traversals

          Show
          Dawid Weiss added a comment - Depends on external sort because that will imply the order of terms used for fst construction/ traversals
          Hide
          Robert Muir added a comment -

          or is external sort dependent on this?

          I would say one advantage of utf8 representation would be easier integration
          with our other automaton stuff... its optimized for utf-8 and there is code
          here and there that intertwines the two. this could be useful for potential
          fuzzy matching (edit distances)

          One disadvantage of utf-8 would be that if you use more than 128 buckets
          it would cause the prefix byte to be plural... I think thats an ok limitation though?

          Show
          Robert Muir added a comment - or is external sort dependent on this? I would say one advantage of utf8 representation would be easier integration with our other automaton stuff... its optimized for utf-8 and there is code here and there that intertwines the two. this could be useful for potential fuzzy matching (edit distances) One disadvantage of utf-8 would be that if you use more than 128 buckets it would cause the prefix byte to be plural... I think thats an ok limitation though?
          Hide
          Robert Muir added a comment -

          actually scratch that limitation: since utf8/utf32 order is also unsigned byte order,
          we can still encode a full range byte for the bucket prefix...

          Show
          Robert Muir added a comment - actually scratch that limitation: since utf8/utf32 order is also unsigned byte order, we can still encode a full range byte for the bucket prefix...
          Dawid Weiss made changes -
          Link This issue blocks SOLR-2887 [ SOLR-2887 ]
          Dawid Weiss made changes -
          Link This issue is blocked by SOLR-2887 [ SOLR-2887 ]
          Dawid Weiss made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Dawid Weiss made changes -
          Link This issue incorporates SOLR-2761 [ SOLR-2761 ]
          Hide
          Dawid Weiss added a comment -

          This replaces utf16 with utf8, Strings with ByteRefs and does some initial API tweaks to move away from the Lookup API.

          Show
          Dawid Weiss added a comment - This replaces utf16 with utf8, Strings with ByteRefs and does some initial API tweaks to move away from the Lookup API.
          Dawid Weiss made changes -
          Attachment SOLR-3888.patch [ 12503905 ]
          Dawid Weiss made changes -
          Attachment SOLR-3888.patch [ 12503905 ]
          Hide
          Dawid Weiss added a comment -

          Proposed patch that fixes SOLR-2888 and SOLR-2887. This is also non-backwards compatible API refactoring – FSTLookup has been split into FSTCompletion (not a Lookup subclass), there is an adapter for Lookup called FSTCompletionLookup.

          These changes try to separate FSTCompletion from strings/floats used in Lookup. An external sorting on disk has been added. I tested it locally with 40m terms from Wikipedia – FST construction was a memory bottleneck, everything else nicely spills to disk. Increasing RAM to ~1.5G did construct the suggester automaton for those 40m terms though.

          Not everything done – still some TODOs and ideas. Feel free to reiterate/ provide feedback.

          Show
          Dawid Weiss added a comment - Proposed patch that fixes SOLR-2888 and SOLR-2887 . This is also non-backwards compatible API refactoring – FSTLookup has been split into FSTCompletion (not a Lookup subclass), there is an adapter for Lookup called FSTCompletionLookup. These changes try to separate FSTCompletion from strings/floats used in Lookup. An external sorting on disk has been added. I tested it locally with 40m terms from Wikipedia – FST construction was a memory bottleneck, everything else nicely spills to disk. Increasing RAM to ~1.5G did construct the suggester automaton for those 40m terms though. Not everything done – still some TODOs and ideas. Feel free to reiterate/ provide feedback.
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504069 ]
          Hide
          Dawid Weiss added a comment -

          The patch contains a logical error in FSTCompletionLookup – weight discretization must assign to buckets based on "score-stickiness"; that is once a score has been assigned to a given bucket all entries need to be assigned to the same bucket. This is needed to make sure the same score is not distributed within different buckets.

          Will fix later.

          Show
          Dawid Weiss added a comment - The patch contains a logical error in FSTCompletionLookup – weight discretization must assign to buckets based on "score-stickiness"; that is once a score has been assigned to a given bucket all entries need to be assigned to the same bucket. This is needed to make sure the same score is not distributed within different buckets. Will fix later.
          Hide
          Dawid Weiss added a comment -

          Cleaner separation of concerns between FSTCompletionBuilder and FSTCompletion. Cleaned up how lookup works (variations passed in the constructor, not in the lookup methods). Added methods for writing and reading automata to FST. Added full sorting of floats based on their int4 order.

          Show
          Dawid Weiss added a comment - Cleaner separation of concerns between FSTCompletionBuilder and FSTCompletion. Cleaned up how lookup works (variations passed in the constructor, not in the lookup methods). Added methods for writing and reading automata to FST. Added full sorting of floats based on their int4 order.
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504201 ]
          Hide
          Dawid Weiss added a comment -

          Fixed that bug with wrong bucketing but also handled what annoyed me most – separated the completion data structure builder and the runtime logic. I think it's much cleaner now, Robert, so if you want to try intersections go ahead. Share patches, I'm curious to look at how hairy it'll be

          Also, I did add full float range to that binary sorting. Like I said – requires some shifts, but not as difficult as I thought it'd be.

          Show
          Dawid Weiss added a comment - Fixed that bug with wrong bucketing but also handled what annoyed me most – separated the completion data structure builder and the runtime logic. I think it's much cleaner now, Robert, so if you want to try intersections go ahead. Share patches, I'm curious to look at how hairy it'll be Also, I did add full float range to that binary sorting. Like I said – requires some shifts, but not as difficult as I thought it'd be.
          Dawid Weiss made changes -
          Link This issue blocks SOLR-2887 [ SOLR-2887 ]
          Dawid Weiss made changes -
          Link This issue incorporates SOLR-2887 [ SOLR-2887 ]
          Dawid Weiss made changes -
          Summary FSTSuggester should use utf8/utf32 order FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
          Description For some reason it uses utf16 internally. Shouldn't make much of a difference, really. This issue incorporates several problems:
          - utf16 was used previously to store and lookup terms, now it is utf8
          - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
          - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
          - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.
          Dawid Weiss made changes -
          Status In Progress [ 3 ] Open [ 1 ]
          Hide
          Dawid Weiss added a comment -

          Same as before, but with fixed normalization of NaNs in float->sortable int4 conversion.

          Show
          Dawid Weiss added a comment - Same as before, but with fixed normalization of NaNs in float->sortable int4 conversion.
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504265 ]
          Hide
          Dawid Weiss added a comment -

          Same as before with debugging code removed.

          Show
          Dawid Weiss added a comment - Same as before with debugging code removed.
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504295 ]
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504265 ]
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504201 ]
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12504069 ]
          Hide
          Dawid Weiss added a comment -

          I would like to commit this in if there are no objections.

          Show
          Dawid Weiss added a comment - I would like to commit this in if there are no objections.
          Hide
          Dawid Weiss added a comment -

          Updated patch:

          • updated to recent API refactorings in BytesRef,
          • FSTCompletion doesn't use LookupResult directly (no intermediate Strings).

          This is ready to be committed, two remaining TODOs (infix suggestions, use of Analyzers for synonym suggestions, full support for float weights) will be split into separate issues.

          Show
          Dawid Weiss added a comment - Updated patch: updated to recent API refactorings in BytesRef, FSTCompletion doesn't use LookupResult directly (no intermediate Strings). This is ready to be committed, two remaining TODOs (infix suggestions, use of Analyzers for synonym suggestions, full support for float weights) will be split into separate issues.
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12505747 ]
          Hide
          Robert Muir added a comment -

          looks good, a few nits:

          • bytesequencesreader is complementary to itself
          • externalrefsorter.close shouldn't mask exceptions i dont think? caller can do this in a try/catch
          • same with the new save()/read() methods added to FST
          Show
          Robert Muir added a comment - looks good, a few nits: bytesequencesreader is complementary to itself externalrefsorter.close shouldn't mask exceptions i dont think? caller can do this in a try/catch same with the new save()/read() methods added to FST
          Hide
          Dawid Weiss added a comment -

          What do you mean by "complementary to itself"? As for closing, sure I can propoagate up the stack.

          Show
          Dawid Weiss added a comment - What do you mean by "complementary to itself"? As for closing, sure I can propoagate up the stack.
          Hide
          Robert Muir added a comment -

          'complementary to itself': recursive javadocs @link for ByteSequencesReader

          Show
          Robert Muir added a comment - 'complementary to itself': recursive javadocs @link for ByteSequencesReader
          Hide
          Dawid Weiss added a comment -

          Ah... right. Sorry, will fix. I thought we're falling into Douglas Adams kind of narrative.

          Show
          Dawid Weiss added a comment - Ah... right. Sorry, will fix. I thought we're falling into Douglas Adams kind of narrative.
          Hide
          Dawid Weiss added a comment -

          Corrected JavaDocs. Corrected handling of IOException on close(), hope I caught all the cases right.

          Show
          Dawid Weiss added a comment - Corrected JavaDocs. Corrected handling of IOException on close(), hope I caught all the cases right.
          Dawid Weiss made changes -
          Attachment SOLR-2888.patch [ 12505774 ]
          Hide
          Dawid Weiss added a comment -

          In trunk.

          Show
          Dawid Weiss added a comment - In trunk.
          Dawid Weiss made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Robert Muir added a comment -

          I'm gonna try to work on a backport for 3.x here.

          Show
          Robert Muir added a comment - I'm gonna try to work on a backport for 3.x here.
          Robert Muir made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Dawid Weiss [ dweiss ] Robert Muir [ rcmuir ]
          Hide
          Robert Muir added a comment -

          Here's the backport: i svn copied all the code and tests from trunk.

          patch shows the differences from the merge only, mostly just java 5 stuff.

          I kept the old FSTLookup to support the old API but deprecated it and its test. I don't think any other backwards compatibility is useful since we changed FST format anyway.

          Show
          Robert Muir added a comment - Here's the backport: i svn copied all the code and tests from trunk. patch shows the differences from the merge only, mostly just java 5 stuff. I kept the old FSTLookup to support the old API but deprecated it and its test. I don't think any other backwards compatibility is useful since we changed FST format anyway.
          Robert Muir made changes -
          Attachment SOLR-2888_backport.patch [ 12510804 ]
          Robert Muir made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 3.6 [ 12319065 ]
          Resolution Fixed [ 1 ]
          Hide
          Dawid Weiss added a comment -

          Thanks for doing the dirty work, Robert.

          Show
          Dawid Weiss added a comment - Thanks for doing the dirty work, Robert.
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Robert Muir
              Reporter:
              Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development