Lucene - Core
  1. Lucene - Core
  2. LUCENE-5214

Add new FreeTextSuggester, to handle "long tail" suggestions

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 5.0
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current suggesters are all based on a finite space of possible
      suggestions, i.e. the ones they were built on, so they can only
      suggest a full suggestion from that space.

      This means if the current query goes outside of that space then no
      suggestions will be found.

      The goal of FreeTextSuggester is to address this, by giving
      predictions based on an ngram language model, i.e. using the last few
      tokens from the user's query to predict likely following token.

      I got the idea from this blog post about Google's suggest:
      http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html

      This is very much still a work in progress, but it seems to be
      working. I've tested it on the AOL query logs, using an interactive
      tool from luceneutil to show the suggestions, and it seems to work well.
      It's fun to use that tool to explore the word associations...

      I don't think this suggester would be used standalone; rather, I think
      it'd be a fallback for times when the primary suggester fails to find
      anything. You can see this behavior on google.com, if you type "the
      fast and the ", you see entire queries being suggested, but then if
      the next word you type is "burning" then suddenly you see the
      suggestions are only based on the last word, not the entire query.

      It uses ShingleFilter under-the-hood to generate the token ngrams;
      once LUCENE-5180 is in it will be able to properly handle a user query
      that ends with stop-words (e.g. "wizard of "), and then stores the
      ngrams in an FST.

      1. LUCENE-5214.patch
        51 kB
        Michael McCandless
      2. LUCENE-5214.patch
        29 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Current patch, very much work in progress...

        Show
        Michael McCandless added a comment - Current patch, very much work in progress...
        Hide
        Robert Muir added a comment -

        This looks awesome: I think LUCENE-5180 will resolve a lot of the TODOs?

        I'm glad these corner cases of trailing stopwords etc were fixed properly in the analysis chain.

        And I like the name...

        Show
        Robert Muir added a comment - This looks awesome: I think LUCENE-5180 will resolve a lot of the TODOs? I'm glad these corner cases of trailing stopwords etc were fixed properly in the analysis chain. And I like the name...
        Hide
        Dawid Weiss added a comment -

        I looked through the patch but I didn't get it, too late I'll give it another shot later.

        Anyway, the idea is very interesting though – I wonder how much left-context (regardless of this implementation) one needs for the right prediction (reminds me of Markov chains and generative poetry

        Show
        Dawid Weiss added a comment - I looked through the patch but I didn't get it, too late I'll give it another shot later. Anyway, the idea is very interesting though – I wonder how much left-context (regardless of this implementation) one needs for the right prediction (reminds me of Markov chains and generative poetry
        Hide
        Michael McCandless added a comment -

        The build method basically just runs all incoming text through the
        indexAnalyzer, appending ShingleFilter on the end to generate the
        ngrams. To "aggregate" the ngrams it simply writes them to the
        offline sorter; this is nice and simple but somewhat inefficient in
        how much transient disk and CPU it needs to sort all the ngrams, but
        it works (thanks Rob)! It may be better to have an in-memory hash
        that holds the frequent ngrams, and periodically flushes the "long
        tail" to free up RAM. But this gets more complex... the current code
        is very simple.

        After sorting the ngrams, it walks them, counting up how many times
        each gram occurred and then adding that to the FST. Currently, I do
        nothing with the surface form, i.e. the suggester only suggests the
        analyzed forms, which may be too ... weird? Though in playing around,
        I think the analysis you generally want to do should be very "light",
        so maybe this is OK.

        It can also save the surface form in the FST (I was doing that before;
        it's commented out now), but ... how to disambiguate? Currently it
        saves the shortest one. This also makes the FST even larger.

        At lookup time I again just run through your analyzer + ShingleFilter,
        and then try first to lookup 3grams, failing that to lookup 2grams,
        etc. I need to improve this to do some sort of smoothing like "real"
        ngram language models do; it shouldn't be this "hard" backoff.

        Anyway, it's great fun playing with the suggester live (using the simplistic
        command-line tool in luceneutil, freedb/suggest.py) to "explore" the
        ngram language model. This is how I discovered LUCENE-5180.

        Show
        Michael McCandless added a comment - The build method basically just runs all incoming text through the indexAnalyzer, appending ShingleFilter on the end to generate the ngrams. To "aggregate" the ngrams it simply writes them to the offline sorter; this is nice and simple but somewhat inefficient in how much transient disk and CPU it needs to sort all the ngrams, but it works (thanks Rob)! It may be better to have an in-memory hash that holds the frequent ngrams, and periodically flushes the "long tail" to free up RAM. But this gets more complex... the current code is very simple. After sorting the ngrams, it walks them, counting up how many times each gram occurred and then adding that to the FST. Currently, I do nothing with the surface form, i.e. the suggester only suggests the analyzed forms, which may be too ... weird? Though in playing around, I think the analysis you generally want to do should be very "light", so maybe this is OK. It can also save the surface form in the FST (I was doing that before; it's commented out now), but ... how to disambiguate? Currently it saves the shortest one. This also makes the FST even larger. At lookup time I again just run through your analyzer + ShingleFilter, and then try first to lookup 3grams, failing that to lookup 2grams, etc. I need to improve this to do some sort of smoothing like "real" ngram language models do; it shouldn't be this "hard" backoff. Anyway, it's great fun playing with the suggester live (using the simplistic command-line tool in luceneutil, freedb/suggest.py) to "explore" the ngram language model. This is how I discovered LUCENE-5180 .
        Hide
        Dawid Weiss added a comment -

        Pretty cool, thanks Mike.

        Show
        Dawid Weiss added a comment - Pretty cool, thanks Mike.
        Hide
        Michael McCandless added a comment -

        New patch, resolving all nocommits. I think it's ready!

        Show
        Michael McCandless added a comment - New patch, resolving all nocommits. I think it's ready!
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Areek Zillur added a comment -

        Hey Michael, had a question for you, this may not be the most relevant place to ask but will do anyways.

        I was curious to know why you did not implement the load and store methods for your AnalyzingInfixSuggester rather build the index at the ctor? was it because of the fact that they take a Input/output stream? What are your thoughts on generalizing the interface so that the index can be loaded up and stored as it is done by all the other suggesters?

        Show
        Areek Zillur added a comment - Hey Michael, had a question for you, this may not be the most relevant place to ask but will do anyways. I was curious to know why you did not implement the load and store methods for your AnalyzingInfixSuggester rather build the index at the ctor? was it because of the fact that they take a Input/output stream? What are your thoughts on generalizing the interface so that the index can be loaded up and stored as it is done by all the other suggesters?
        Hide
        Michael McCandless added a comment -

        I was curious to know why you did not implement the load and store methods for your AnalyzingInfixSuggester rather build the index at the ctor?

        Well ... once you .build() the AnalyzingInfixSuggester, it's already "stored" since it's backed by an on-disk index. So this suggester is somewhat different from others (it's not RAM resident ... hmm unless you provide a RAMDir in getDirectory).

        In the ctor, if there's already a previously built suggester, I just open the searcher there. I suppose we could move that code into load() instead?

        was it because of the fact that they take a Input/output stream?

        That is sort of weird; I think we have an issue open to change that to Directory or maybe IndexInput/Output or something ...

        What are your thoughts on generalizing the interface so that the index can be loaded up and stored as it is done by all the other suggesters?

        +1 to somehow improve the suggester APIs (I think there's yet another issue opened for that).

        Do you mean loaded into a RAMDir?

        this may not be the most relevant place to ask but will do anyways.

        That's fine Just send an email to dev@ next time ...

        your AnalyzingInfixSuggester

        It's not "mine" Anyone can and should go fix it!

        Show
        Michael McCandless added a comment - I was curious to know why you did not implement the load and store methods for your AnalyzingInfixSuggester rather build the index at the ctor? Well ... once you .build() the AnalyzingInfixSuggester, it's already "stored" since it's backed by an on-disk index. So this suggester is somewhat different from others (it's not RAM resident ... hmm unless you provide a RAMDir in getDirectory). In the ctor, if there's already a previously built suggester, I just open the searcher there. I suppose we could move that code into load() instead? was it because of the fact that they take a Input/output stream? That is sort of weird; I think we have an issue open to change that to Directory or maybe IndexInput/Output or something ... What are your thoughts on generalizing the interface so that the index can be loaded up and stored as it is done by all the other suggesters? +1 to somehow improve the suggester APIs (I think there's yet another issue opened for that). Do you mean loaded into a RAMDir? this may not be the most relevant place to ask but will do anyways. That's fine Just send an email to dev@ next time ... your AnalyzingInfixSuggester It's not "mine" Anyone can and should go fix it!
        Hide
        ASF subversion and git services added a comment -

        Commit 1528517 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1528517 ]

        LUCENE-5214: add FreeTextSuggester

        Show
        ASF subversion and git services added a comment - Commit 1528517 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1528517 ] LUCENE-5214 : add FreeTextSuggester
        Hide
        ASF subversion and git services added a comment -

        Commit 1528521 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1528521 ]

        LUCENE-5214: add FreeTextSuggester

        Show
        ASF subversion and git services added a comment - Commit 1528521 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1528521 ] LUCENE-5214 : add FreeTextSuggester
        Hide
        ASF subversion and git services added a comment -

        Commit 1528579 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1528579 ]

        LUCENE-5214: remove java-7 only @SafeVarargs

        Show
        ASF subversion and git services added a comment - Commit 1528579 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1528579 ] LUCENE-5214 : remove java-7 only @SafeVarargs

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development