Solr
  1. Solr
  2. SOLR-2628

use of FST for SynonymsFilterFactory and synonyms.txt

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.4, 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
    • Environment:

      Linux

      Description

      Currently the SynonymsFilterFactory builds up a memory based SynonymsMap.
      This can generate huge maps because of the permutations for synonyms.

      Now where FST (finite state transducer) is introduced to lucene this could also be used for synonyms.
      A tool can compile the synoynms.txt file to a binary automaton file which can then be used
      with SynoynmsFilterFactory.

      Advantage:

      • faster start of solr, no need to generate SynonymsMap
      • faster lookup
      • memory saving

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          I've talked about it a little bit with Bernd and indeed, it seems possible to reduce the size of in-memory data structures by an order of magnitude (or even two orders of magnitude, we shall see). I'm on vacation for the next week and on a business trip for another one after that, but I'll be on it once I come back home.

          Show
          Dawid Weiss added a comment - I've talked about it a little bit with Bernd and indeed, it seems possible to reduce the size of in-memory data structures by an order of magnitude (or even two orders of magnitude, we shall see). I'm on vacation for the next week and on a business trip for another one after that, but I'll be on it once I come back home.
          Hide
          Michael McCandless added a comment -

          Dawid, have a look at LUCENE-3233 – we have a [very very rough] start at this.

          Show
          Michael McCandless added a comment - Dawid, have a look at LUCENE-3233 – we have a [very very rough] start at this.
          Hide
          Dawid Weiss added a comment -

          Duplicate of LUCENE-3233

          Show
          Dawid Weiss added a comment - Duplicate of LUCENE-3233
          Hide
          Dawid Weiss added a comment -

          Yep, this is a duplicate. Thanks Mike. Like I said – I won't be able to work on this for the next two weeks (I also have that FST refactoring opened up in the background... it's progressing slowly), but it's definitely a low-hanging fruit to pick because it shouldn't be very difficult and the gains huge.

          Show
          Dawid Weiss added a comment - Yep, this is a duplicate. Thanks Mike. Like I said – I won't be able to work on this for the next two weeks (I also have that FST refactoring opened up in the background... it's progressing slowly), but it's definitely a low-hanging fruit to pick because it shouldn't be very difficult and the gains huge.
          Hide
          Michael McCandless added a comment -

          I think the reduction of RAM should be huge but lookup speed might be slower (ie the usual tradeoff of FST), since we are going char by char in the FST. If we go word-by-word (ie FST's labels are word ords and we separately resolve word -> ord via "normal" hash lookup) then that might be a good middle ground... but this is all speculation for now!

          Show
          Michael McCandless added a comment - I think the reduction of RAM should be huge but lookup speed might be slower (ie the usual tradeoff of FST), since we are going char by char in the FST. If we go word-by-word (ie FST's labels are word ords and we separately resolve word -> ord via "normal" hash lookup) then that might be a good middle ground... but this is all speculation for now!
          Hide
          Dawid Weiss added a comment -

          Yes, this may be the case. It'd need to be investigated because storing words in a hashtable will also bump memory requirements, whereas an FST can at least reuse some prefixes and suffixes.

          Show
          Dawid Weiss added a comment - Yes, this may be the case. It'd need to be investigated because storing words in a hashtable will also bump memory requirements, whereas an FST can at least reuse some prefixes and suffixes.

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Bernd Fehling
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development