Lucene - Core
  1. Lucene - Core
  2. LUCENE-1190

a lexicon object for merging spellchecker and synonyms from stemming

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.3
    • Fix Version/s: None
    • Component/s: core/search, modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
      For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
      Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
      Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
      Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
      A similarTokenFilter is provided.
      A spellchecker will come soon.
      A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
      Unused words can be remove on demand (lazy delete?)

      Any criticism or suggestions?

      1. aphone+lexicon.patch
        303 kB
        Mathieu Lecarme
      2. aphone+lexicon.patch
        336 kB
        Mathieu Lecarme

        Activity

        Mathieu Lecarme created issue -
        Mathieu Lecarme made changes -
        Field Original Value New Value
        Attachment aphone+lexicon.patch [ 12376437 ]
        Hide
        Mathieu Lecarme added a comment -

        News features:
        helper to extends query with similarity of each term :
        +type:dog +name:rintint*
        will become:
        +type+dog (dogs doggy)^0.7) +name:rintint*

        "Do you mean pattern" packaged over IndexSearcher. If search result is under a thresold, sorted suggestion list for each term is provided, and a rewritten query sentence:
        truc:brawn
        will become:
        truc:brown

        Show
        Mathieu Lecarme added a comment - News features: helper to extends query with similarity of each term : +type:dog +name:rintint* will become: +type +dog (dogs doggy)^0.7) +name:rintint* "Do you mean pattern" packaged over IndexSearcher. If search result is under a thresold, sorted suggestion list for each term is provided, and a rewritten query sentence: truc:brawn will become: truc:brown
        Mathieu Lecarme made changes -
        Attachment aphone+lexicon.patch [ 12376860 ]
        Hide
        Otis Gospodnetic added a comment -

        This sounds like something that might be interesting, but honestly I don't follow the initial description and the 300KB+ patch is a big one.

        For example, I don't know what you mean by "Some Lucene features need a list of referring word". Do you mean "a list of associated words"?

        Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).

        Each meta is a Field.... what do you mean by that? Could you please give an example?

        Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.

        Hm, not sure I know what you mean. Are you saying that once you create a sufficiently large lexicon/dictionary/index, the number of new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law )

        Show
        Otis Gospodnetic added a comment - This sounds like something that might be interesting, but honestly I don't follow the initial description and the 300KB+ patch is a big one. For example, I don't know what you mean by "Some Lucene features need a list of referring word". Do you mean "a list of associated words"? Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...). Each meta is a Field.... what do you mean by that? Could you please give an example? Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used. Hm, not sure I know what you mean. Are you saying that once you create a sufficiently large lexicon/dictionary/index, the number of new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law )
        Hide
        Mathieu Lecarme added a comment -

        With a FuzzyQuery, for example, you iterate over Term in index, and
        looking for the nearest one. PrefixQuery or regular expression work in
        a similar way.
        If you say, fuzzy querying will never gives a word with different size
        of 1 (size+1 or size -1), you can restrict the list of candidates, and
        ngram index can help you more.

        Some token filter destroy the word. Stemmer for example. If you wont
        to search wide, stemmer can help you, but can't use PrefixQuery with
        stemmed word. So, you can stemme word in a lexicon and use it as a
        synonym. You index "dog" and look for "doggy", "dogs" and "dog".
        Lexicon can use static list of word, from hunspell index or wikipedia
        parsing, or words extracted from your index.

        for the word "Lucene" :

        word:lucene
        pop:42
        anagram.anagram:celnu
        aphone.start:LS
        aphone.gram:LS
        aphone.gram:SN
        aphone.end:SN
        aphone.size:3
        aphone.phonem:LSN
        ngram.start:lu
        ngram.gram:lu
        ngram.gram:uc
        ngram.gram:ce
        ngram.gram:en
        ngram.gram:ne
        ngram.end:ne
        ngram.size:6
        stemmer.stem:lucen

        Yes.

        M.

        Show
        Mathieu Lecarme added a comment - With a FuzzyQuery, for example, you iterate over Term in index, and looking for the nearest one. PrefixQuery or regular expression work in a similar way. If you say, fuzzy querying will never gives a word with different size of 1 (size+1 or size -1), you can restrict the list of candidates, and ngram index can help you more. Some token filter destroy the word. Stemmer for example. If you wont to search wide, stemmer can help you, but can't use PrefixQuery with stemmed word. So, you can stemme word in a lexicon and use it as a synonym. You index "dog" and look for "doggy", "dogs" and "dog". Lexicon can use static list of word, from hunspell index or wikipedia parsing, or words extracted from your index. for the word "Lucene" : word:lucene pop:42 anagram.anagram:celnu aphone.start:LS aphone.gram:LS aphone.gram:SN aphone.end:SN aphone.size:3 aphone.phonem:LSN ngram.start:lu ngram.gram:lu ngram.gram:uc ngram.gram:ce ngram.gram:en ngram.gram:ne ngram.end:ne ngram.size:6 stemmer.stem:lucen Yes. M.
        Show
        Mathieu Lecarme added a comment - A simpler preview of Lexicon features : http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index
        Hide
        Otis Gospodnetic added a comment -

        Just came across this old issue, and still can't easily follow it. But I wonder if this issue has become irrelevant with all the new work on analyzers that Robert Muir & Co. are doing?

        Show
        Otis Gospodnetic added a comment - Just came across this old issue, and still can't easily follow it. But I wonder if this issue has become irrelevant with all the new work on analyzers that Robert Muir & Co. are doing?
        Hide
        Robert Muir added a comment -

        Hi Otis, I took a look, and followed the blog link and explored
        the linked svn there (it was easier than reading the patch).

        I guess the interesting approach I see here is what looks to be
        some generation of phonetic filters (similar to the ones in Solr)
        from aspell resources.

        Honestly though, I am not knowledgeable on aspell to know
        to what degree this would work for some of these languages,
        and how it would compare to things like metaphone.

        So, we could potentially use this idea if people wanted some
        more phonetic 'hash' functions available for specific languages,
        but I have a few concerns:

        • I do not know the license of the aspell resources these were gen'ed from
        • As mentioned above, I don't know the quality.
        • I think it would be preferable for the filter to work from the aspell files rather
          than gen'ing code if possible

        As far as what hunspell offers in comparison, I am not sure that
        it has this, instead it offers things like typical replacements that
        can be attempted for spellchecking and such.. Chris Male might
        know more as he has really been the one digging in.

        Show
        Robert Muir added a comment - Hi Otis, I took a look, and followed the blog link and explored the linked svn there (it was easier than reading the patch). I guess the interesting approach I see here is what looks to be some generation of phonetic filters (similar to the ones in Solr) from aspell resources. Honestly though, I am not knowledgeable on aspell to know to what degree this would work for some of these languages, and how it would compare to things like metaphone. So, we could potentially use this idea if people wanted some more phonetic 'hash' functions available for specific languages, but I have a few concerns: I do not know the license of the aspell resources these were gen'ed from As mentioned above, I don't know the quality. I think it would be preferable for the filter to work from the aspell files rather than gen'ing code if possible As far as what hunspell offers in comparison, I am not sure that it has this, instead it offers things like typical replacements that can be attempted for spellchecking and such.. Chris Male might know more as he has really been the one digging in.
        Hide
        Otis Gospodnetic added a comment -

        I'll close this shortly, unless people object and want to use something from here.

        Show
        Otis Gospodnetic added a comment - I'll close this shortly, unless people object and want to use something from here.
        Otis Gospodnetic made changes -
        Assignee Otis Gospodnetic [ otis ]
        Mark Thomas made changes -
        Workflow jira [ 12424423 ] Default workflow, editable Closed status [ 12563468 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563468 ] jira [ 12585023 ]
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 We can reopen if necessary.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 We can reopen if necessary.
        Erick Erickson made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]

          People

          • Assignee:
            Otis Gospodnetic
            Reporter:
            Mathieu Lecarme
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development