Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5030

FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space.

      This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST.

      See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

        Attachments

        1. benchmark-INFO_SEP.txt
          4 kB
          Artem Lukanin
        2. benchmark-old.txt
          4 kB
          Artem Lukanin
        3. benchmark-wo_convertion.txt
          4 kB
          Artem Lukanin
        4. LUCENE-5030.patch
          28 kB
          Artem Lukanin
        5. LUCENE-5030.patch
          29 kB
          Artem Lukanin
        6. LUCENE-5030.patch
          30 kB
          Artem Lukanin
        7. LUCENE-5030.patch
          29 kB
          Michael McCandless
        8. LUCENE-5030.patch
          178 kB
          Artem Lukanin
        9. LUCENE-5030.patch
          175 kB
          Artem Lukanin
        10. nonlatin_fuzzySuggester_combo.patch
          187 kB
          Artem Lukanin
        11. nonlatin_fuzzySuggester_combo1.patch
          37 kB
          Artem Lukanin
        12. nonlatin_fuzzySuggester_combo2.patch
          184 kB
          Artem Lukanin
        13. nonlatin_fuzzySuggester.patch
          168 kB
          Artem Lukanin
        14. nonlatin_fuzzySuggester.patch
          167 kB
          Artem Lukanin
        15. nonlatin_fuzzySuggester.patch
          68 kB
          Artem Lukanin
        16. nonlatin_fuzzySuggester1.patch
          169 kB
          Artem Lukanin
        17. nonlatin_fuzzySuggester2.patch
          135 kB
          Artem Lukanin
        18. nonlatin_fuzzySuggester3.patch
          117 kB
          Artem Lukanin
        19. nonlatin_fuzzySuggester4.patch
          118 kB
          Artem Lukanin
        20. run-suggest-benchmark.patch
          2 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              alukanin Artem Lukanin
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: