Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5030

FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.3
    • 4.5, 6.0
    • None
    • None
    • New

    Description

      There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space.

      This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST.

      See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

      Attachments

        1. benchmark-INFO_SEP.txt
          4 kB
          Artem Lukanin
        2. benchmark-old.txt
          4 kB
          Artem Lukanin
        3. benchmark-wo_convertion.txt
          4 kB
          Artem Lukanin
        4. LUCENE-5030.patch
          28 kB
          Artem Lukanin
        5. LUCENE-5030.patch
          29 kB
          Artem Lukanin
        6. LUCENE-5030.patch
          30 kB
          Artem Lukanin
        7. LUCENE-5030.patch
          29 kB
          Michael McCandless
        8. LUCENE-5030.patch
          178 kB
          Artem Lukanin
        9. LUCENE-5030.patch
          175 kB
          Artem Lukanin
        10. nonlatin_fuzzySuggester_combo.patch
          187 kB
          Artem Lukanin
        11. nonlatin_fuzzySuggester_combo1.patch
          37 kB
          Artem Lukanin
        12. nonlatin_fuzzySuggester_combo2.patch
          184 kB
          Artem Lukanin
        13. nonlatin_fuzzySuggester.patch
          168 kB
          Artem Lukanin
        14. nonlatin_fuzzySuggester.patch
          167 kB
          Artem Lukanin
        15. nonlatin_fuzzySuggester.patch
          68 kB
          Artem Lukanin
        16. nonlatin_fuzzySuggester1.patch
          169 kB
          Artem Lukanin
        17. nonlatin_fuzzySuggester2.patch
          135 kB
          Artem Lukanin
        18. nonlatin_fuzzySuggester3.patch
          117 kB
          Artem Lukanin
        19. nonlatin_fuzzySuggester4.patch
          118 kB
          Artem Lukanin
        20. run-suggest-benchmark.patch
          2 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            alukanin Artem Lukanin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: