Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9429

Spellcheck Token Filter

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This issue is about the design and implementation of a new token filter called : SpellcheckTokenFilter

      This new token filter takes in input the token stream and return collated tokens, based on a Dictionary.
      The aim of the token filter is to fix mispelled word and index the correct token.

      e.g.
      Given dictionary d1 :
      gaming
      gamer

      Given text t1 for the field f1 :
      gamign is a strong industry

      The token filter will return in output :
      gaming is a strong industry

      A first possible design is to mimic the approach used in the spellchecker.
      Building an FST for the dictionary, then building the levenstein FST for each token and doing the intersection .

      Possible application could be for OCR generated text and other use cases when misspelled words are common and we want to clean them up at indexing time.
      This can possibly be used in a complex analyser adding a stemmer afterward.

      This is draft idea coming from a blog comment of Shyamsunder.
      Feedback and additional ideas are welcome!

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              alessandro.benedetti Alessandro Benedetti
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: