Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.1, Trunk
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
      I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.

      In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
      such that we build a wFST with:
      input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
      output: surface form such as "the ghost of christmas past"
      weight: the weight of the suggestion

      we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
      the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.

      This allows a lot of flexibility:

      • Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
        it will suggest "the ghost of christmas past"
      • we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
      • this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
        so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
      • other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.

      According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
      explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

      1. LUCENE-3842-TokenStream_to_Automaton.patch
        17 kB
        Michael McCandless
      2. LUCENE-3842.patch
        26 kB
        Robert Muir
      3. LUCENE-3842.patch
        44 kB
        Robert Muir
      4. LUCENE-3842.patch
        34 kB
        Robert Muir
      5. LUCENE-3842.patch
        34 kB
        Robert Muir
      6. LUCENE-3842.patch
        53 kB
        Michael McCandless
      7. LUCENE-3842.patch
        58 kB
        Michael McCandless
      8. LUCENE-3842.patch
        64 kB
        Michael McCandless
      9. LUCENE-3842.patch
        65 kB
        Robert Muir
      10. LUCENE-3842.patch
        74 kB
        Sudarshan Gaikaiwari
      11. LUCENE-3842.patch
        72 kB
        Michael McCandless
      12. LUCENE-3842.patch
        20 kB
        Michael McCandless
      13. LUCENE-3842.patch
        17 kB
        Michael McCandless
      14. LUCENE-3842.patch
        25 kB
        Michael McCandless
      15. LUCENE-3842.patch
        22 kB
        Michael McCandless
      16. LUCENE-3842.patch
        46 kB
        Michael McCandless
      17. LUCENE-3842.patch
        116 kB
        Michael McCandless
      18. LUCENE-3842.patch
        117 kB
        Michael McCandless
      19. LUCENE-3842.patch
        118 kB
        Michael McCandless

        Issue Links

          Activity

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Robert Muir
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development