Solr
  1. Solr
  2. SOLR-2015

add a config hook for autoGeneratePhraseQueries

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

      1. SOLR-2015.patch
        3 kB
        Koji Sekiguchi
      2. SOLR-2015.patch
        5 kB
        Yonik Seeley
      3. SOLR-2015.patch
        10 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        This should really be on a per-field basis at a minimum.
        Even better, it should be in the token stream itself (i.e. some produced groups of tokens should be treated as a phrase, and some shouldn't... only the filter producing them knows for sure).

        Show
        Yonik Seeley added a comment - This should really be on a per-field basis at a minimum. Even better, it should be in the token stream itself (i.e. some produced groups of tokens should be treated as a phrase, and some shouldn't... only the filter producing them knows for sure).
        Hide
        Koji Sekiguchi added a comment -

        How can I implement "on a per-field basis"? The flag seems to affect globally.

        Show
        Koji Sekiguchi added a comment - How can I implement "on a per-field basis"? The flag seems to affect globally.
        Hide
        Robert Muir added a comment -

        How can I implement "on a per-field basis"?

        For per-field control, you must do it in your subclass instead of the flag.
        The easiest way is this:

        @Override
        protected Query getFieldQuery(String field, String queryText, boolean quoted) {
        // if we should generate for this field, then hardcode 'true' as quoted.
        // so this means all whitespace-separated parts of the query are treated as quoted.
        if (shouldAutoGeneratePhrasesFor(field))
          Query = super.getFieldQuery(field, queryText, true);
        else
          Query = super.getFieldQuery(field, queryText, quoted);
        }
        
        Show
        Robert Muir added a comment - How can I implement "on a per-field basis"? For per-field control, you must do it in your subclass instead of the flag. The easiest way is this: @Override protected Query getFieldQuery( String field, String queryText, boolean quoted) { // if we should generate for this field, then hardcode ' true ' as quoted. // so this means all whitespace-separated parts of the query are treated as quoted. if (shouldAutoGeneratePhrasesFor(field)) Query = super .getFieldQuery(field, queryText, true ); else Query = super .getFieldQuery(field, queryText, quoted); }
        Hide
        Koji Sekiguchi added a comment -

        I see, thanks.

        Show
        Koji Sekiguchi added a comment - I see, thanks.
        Hide
        Yonik Seeley added a comment -

        I'm upping this to the highest priority and taking it since the default behavior for our solr example server now really stinks.

        Show
        Yonik Seeley added a comment - I'm upping this to the highest priority and taking it since the default behavior for our solr example server now really stinks.
        Hide
        Robert Muir added a comment -

        I'm upping this to the highest priority and taking it since the default behavior for our solr example server now really stinks.

        I don't think the default behavior stinks at all. As stated before, it now works with languages such as Thai where it formerly didnt really work at all (all queries are phrase queries).
        If you don't think the behavior for english is perfect thats fine, but an open source product should work reasonably well for all languages.
        So I don't think we should default with this behavior on, this behavior that is tied to whitespace-tokenization.

        Show
        Robert Muir added a comment - I'm upping this to the highest priority and taking it since the default behavior for our solr example server now really stinks. I don't think the default behavior stinks at all. As stated before, it now works with languages such as Thai where it formerly didnt really work at all (all queries are phrase queries). If you don't think the behavior for english is perfect thats fine, but an open source product should work reasonably well for all languages. So I don't think we should default with this behavior on, this behavior that is tied to whitespace-tokenization.
        Hide
        Yonik Seeley added a comment -

        OK, here's a prototype patch.
        I'll add some tests next.

        Show
        Yonik Seeley added a comment - OK, here's a prototype patch. I'll add some tests next.
        Hide
        Robert Muir added a comment -

        Yonik, i just dont think the default for autoGeneratePhrase queries should be "true", but false instead.
        This is no problem for older existing schemas as the Version constant is respected already.
        And I think it should be documented (e.g. in the example type text) that this option might not be suitable for non-whitespace separated languages.

        Other than these concerns, I think in the fieldtype like this is a good approach.

        Show
        Robert Muir added a comment - Yonik, i just dont think the default for autoGeneratePhrase queries should be "true", but false instead. This is no problem for older existing schemas as the Version constant is respected already. And I think it should be documented (e.g. in the example type text) that this option might not be suitable for non-whitespace separated languages. Other than these concerns, I think in the fieldtype like this is a good approach.
        Hide
        Yonik Seeley added a comment -

        autoGeneratePhrase=true has been the behavior forever (before July 19th)... this just allows the behavior configurable per-field. Changing the default to false would only make sense if it was a better choice for the majority of our users... and I don't think it is.
        Although back compat is not the primary concern here, it is nice that someone can switch to the newest version and cut-n-paste some of their previous field definitions that worked well for them.

        Our example schema is english oriented.
        All of the example docs are in english, the "text" field has an english stemmer, the tutorial is in english, and people must know english in order to collaborate with our development. English is the international language and we shouldn't make relevancy worse for it and other whitespace delimited languages by default.

        I do also want to make things work better for other international languages - but not at the cost of european languages. Given our existing user base, I think that's an acceptable position. Now that we have both the ability to turn off autoGeneratePhrase, and the ability to configure it per-field, what international field types should we add to the example schema to improve the situation?

        Show
        Yonik Seeley added a comment - autoGeneratePhrase=true has been the behavior forever (before July 19th)... this just allows the behavior configurable per-field. Changing the default to false would only make sense if it was a better choice for the majority of our users... and I don't think it is. Although back compat is not the primary concern here, it is nice that someone can switch to the newest version and cut-n-paste some of their previous field definitions that worked well for them. Our example schema is english oriented. All of the example docs are in english, the "text" field has an english stemmer, the tutorial is in english, and people must know english in order to collaborate with our development. English is the international language and we shouldn't make relevancy worse for it and other whitespace delimited languages by default. I do also want to make things work better for other international languages - but not at the cost of european languages. Given our existing user base, I think that's an acceptable position. Now that we have both the ability to turn off autoGeneratePhrase, and the ability to configure it per-field, what international field types should we add to the example schema to improve the situation?
        Hide
        Robert Muir added a comment -

        though I disagree with a signficant amount of statements you made,
        I don't think we would ever come to agreement anyway.

        but, my concerns about this default basically disappear if we could
        have example configs for other languages: first-class in the example
        schema.xml and not tucked away and difficult to find. could even be
        commented out.

        because my problem with the default is all about making it more
        difficult to get reasonable behavior, forcing people to go thru
        unneccessary hoops when all this shit can easily work.

        Show
        Robert Muir added a comment - though I disagree with a signficant amount of statements you made, I don't think we would ever come to agreement anyway. but, my concerns about this default basically disappear if we could have example configs for other languages: first-class in the example schema.xml and not tucked away and difficult to find. could even be commented out. because my problem with the default is all about making it more difficult to get reasonable behavior, forcing people to go thru unneccessary hoops when all this shit can easily work.
        Hide
        Yonik Seeley added a comment -

        Here's an updated patch that adds a simple test, along with adding a note about autoGeneratePhraseQueries="true" not working well for non whitespace delimited languages.

        Show
        Yonik Seeley added a comment - Here's an updated patch that adds a simple test, along with adding a note about autoGeneratePhraseQueries="true" not working well for non whitespace delimited languages.
        Hide
        Michael McCandless added a comment -

        Can we make different example config/schema XML files for whitespace vs non-whitespace languages?

        Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?

        Show
        Michael McCandless added a comment - Can we make different example config/schema XML files for whitespace vs non-whitespace languages? Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?
        Hide
        Robert Muir added a comment -

        Can we make different example config/schema XML files for whitespace vs non-whitespace languages?

        Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?

        +1, the config shouldn't be in english, english isn't the international language, its not special.

        It might be important to Lucid or someone else, but I don't give a shit about it.

        This is an open source project, one language doesnt get to be held in higher esteem than another.

        Show
        Robert Muir added a comment - Can we make different example config/schema XML files for whitespace vs non-whitespace languages? Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr? +1, the config shouldn't be in english, english isn't the international language, its not special. It might be important to Lucid or someone else, but I don't give a shit about it. This is an open source project, one language doesnt get to be held in higher esteem than another.
        Hide
        Yonik Seeley added a comment -

        Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?

        Solr doesn't have an installer though... you unzip and "cd example; java -jar start.jar".
        And there are also some people interested in multiple languages in the same index. Aside: some of these people would like multiple languages in the same field, which is part of the reason why I always felt that a the information about how two tokens are related should be produced by the tokenizer/filter creating such tokens.

        Can we make different example config/schema XML files for whitespace vs non-whitespace languages?

        I'm not sure what that would accomplish by itself though... it's not like solr is much of an out-of-the-box solution for anything.
        We have a default example so that people can easily run through the tutorial, and execute examples on wiki pages.
        If there is a single field type that is good for many non-whitespace languages, it seems like we should just add it to the example schema.
        And if there is enough demand to demonstrate Solr's international capabilities, we could add a few different-language docs to example/exampledocs and perhaps even to the tutorial.

        More OOTB support for many languages is related to SOLR-1860 too.

        Show
        Yonik Seeley added a comment - Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr? Solr doesn't have an installer though... you unzip and "cd example; java -jar start.jar". And there are also some people interested in multiple languages in the same index. Aside: some of these people would like multiple languages in the same field, which is part of the reason why I always felt that a the information about how two tokens are related should be produced by the tokenizer/filter creating such tokens. Can we make different example config/schema XML files for whitespace vs non-whitespace languages? I'm not sure what that would accomplish by itself though... it's not like solr is much of an out-of-the-box solution for anything. We have a default example so that people can easily run through the tutorial, and execute examples on wiki pages. If there is a single field type that is good for many non-whitespace languages, it seems like we should just add it to the example schema. And if there is enough demand to demonstrate Solr's international capabilities, we could add a few different-language docs to example/exampledocs and perhaps even to the tutorial. More OOTB support for many languages is related to SOLR-1860 too.
        Hide
        Robert Muir added a comment -

        Aside: some of these people would like multiple languages in the same field, which is part of the reason why I always felt that a the information about how two tokens are related should be produced by the tokenizer/filter creating such tokens.

        I don't think we should design our apis around such hacks, especially unproven ones. I don't think the auto phrase generation actually helps english at all, and no one has shown results anywhere that it helps. The reason I don't think it helps is because any improvement in precision is accompanied by decrease in recall: e.g. in this example from the user list, not using the phrase query would find the document, but if you use the phrase query, it doesn't. http://www.lucidimagination.com/search/document/bacf34995067e3cb/worddelimiterfilter_and_phrase_queries

        Furthermore, I dont think we should try to make complicated support for multiple languages. Instead we should support simple, proven approaches such as simple language-independent tokenization or n-gram analysis that actually works, not trying to support fine-grained detection and fancy stuff that overly complicates APIs and only provides worse results: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6844

        Show
        Robert Muir added a comment - Aside: some of these people would like multiple languages in the same field, which is part of the reason why I always felt that a the information about how two tokens are related should be produced by the tokenizer/filter creating such tokens. I don't think we should design our apis around such hacks, especially unproven ones. I don't think the auto phrase generation actually helps english at all, and no one has shown results anywhere that it helps. The reason I don't think it helps is because any improvement in precision is accompanied by decrease in recall: e.g. in this example from the user list, not using the phrase query would find the document, but if you use the phrase query, it doesn't. http://www.lucidimagination.com/search/document/bacf34995067e3cb/worddelimiterfilter_and_phrase_queries Furthermore, I dont think we should try to make complicated support for multiple languages. Instead we should support simple, proven approaches such as simple language-independent tokenization or n-gram analysis that actually works, not trying to support fine-grained detection and fancy stuff that overly complicates APIs and only provides worse results: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6844
        Hide
        Michael McCandless added a comment -

        Don't forget that this auto-phrase-gen is buggy: if the user's query
        is wi fi, then this will not turn into a phrase.

        Really, it's QueryParser that's buggy: it should not assume it can
        pre-split on whitespace.

        As Robert has pointed out, even if the feature weren't buggy, there's
        no evidence auto-phrase-gen actually improves relevance even for
        English.

        Yet it's most definitely disastrous for non-whitespace languages (CJK,
        Thai, etc.).

        This is why, in my opinion, if we must pick a single global default
        (for the 'text' field in Solr's example schema.xml), it should be
        disabled by default: it's buggy for English and catastrophic for
        non-whitespace languages.

        To fix this "correctly", we somehow need a better QueryParser/Analyzer
        interaction, such that all variants of wifi (WiFi, wifi, wi fi, wi-fi)
        are consistently mapped during indexing and searching. Just adding a
        new per-token attr doesn't fix it (the wi fi example, above).

        I'm not sure what that would accomplish by itself though... it's not like solr is much of an out-of-the-box solution for anything.
        We have a default example so that people can easily run through the tutorial, and execute examples on wiki pages.

        I suspect many apps take the default solrconfig/schema and run with
        it / iteratitvely tweak it.

        Solr doesn't have an installer though... you unzip and "cd example; java -jar start.jar".

        Maybe we insert a "cp

        {english,cjk}

        schema.xml schema.xml" in between
        those two steps? This would avoid the global default, ie, force an
        explicit choice.

        Or maybe we make separate default fieldTypes in schema.xml
        (text_whitespace, text_non_whitespace – need better names)?

        Or, maybe we make this setting take three values: unset, on, off. It
        defaults to unset, but Solr refuses to run with this value, throwing
        an exception saying you must set it?

        Something along these lines would let us avoid having to agree on a
        global default, ie, make the choice explicit.

        This is just like what we did with maxFieldLength a while back. Previously
        it silently truncated after 10K terms, which was a dangerous default. So, we
        forced the choice, by making it a required param in IW. (Later we then
        change the default to no truncation, and make it not required).

        Show
        Michael McCandless added a comment - Don't forget that this auto-phrase-gen is buggy: if the user's query is wi fi, then this will not turn into a phrase. Really, it's QueryParser that's buggy: it should not assume it can pre-split on whitespace. As Robert has pointed out, even if the feature weren't buggy, there's no evidence auto-phrase-gen actually improves relevance even for English. Yet it's most definitely disastrous for non-whitespace languages (CJK, Thai, etc.). This is why, in my opinion, if we must pick a single global default (for the 'text' field in Solr's example schema.xml), it should be disabled by default: it's buggy for English and catastrophic for non-whitespace languages. To fix this "correctly", we somehow need a better QueryParser/Analyzer interaction, such that all variants of wifi (WiFi, wifi, wi fi, wi-fi) are consistently mapped during indexing and searching. Just adding a new per-token attr doesn't fix it (the wi fi example, above). I'm not sure what that would accomplish by itself though... it's not like solr is much of an out-of-the-box solution for anything. We have a default example so that people can easily run through the tutorial, and execute examples on wiki pages. I suspect many apps take the default solrconfig/schema and run with it / iteratitvely tweak it. Solr doesn't have an installer though... you unzip and "cd example; java -jar start.jar". Maybe we insert a "cp {english,cjk} schema.xml schema.xml" in between those two steps? This would avoid the global default, ie, force an explicit choice. Or maybe we make separate default fieldTypes in schema.xml (text_whitespace, text_non_whitespace – need better names)? Or, maybe we make this setting take three values: unset, on, off. It defaults to unset, but Solr refuses to run with this value, throwing an exception saying you must set it? Something along these lines would let us avoid having to agree on a global default, ie, make the choice explicit. This is just like what we did with maxFieldLength a while back. Previously it silently truncated after 10K terms, which was a dangerous default. So, we forced the choice, by making it a required param in IW. (Later we then change the default to no truncation, and make it not required).
        Hide
        Robert Muir added a comment -

        Even for the euro-languages where people think this is helpful, its sometimes a disaster.

        I noticed a french case here where it caused a serious problem (enough for them to write custom code to try to get around it): http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance

        Finally, I think this dictates behavior to the end user, and doesn't consider their information need at all.
        Since google etc have become popular, i think users are familiar with putting things in quotes themselves.
        So a user who wants this behavior (causing a phrase) can always trigger it by putting the query in quotes.

        This allows them to refine the query themselves like they would do in any other situation, its way more user-friendly
        and consistent.

        Show
        Robert Muir added a comment - Even for the euro-languages where people think this is helpful, its sometimes a disaster. I noticed a french case here where it caused a serious problem (enough for them to write custom code to try to get around it): http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance Finally, I think this dictates behavior to the end user, and doesn't consider their information need at all. Since google etc have become popular, i think users are familiar with putting things in quotes themselves. So a user who wants this behavior (causing a phrase) can always trigger it by putting the query in quotes. This allows them to refine the query themselves like they would do in any other situation, its way more user-friendly and consistent.
        Hide
        Yonik Seeley added a comment -

        is wi fi, then this will not turn into a phrase.

        Right - but there's just a lack of information that can't be helped?
        So while one might want stuff like this as a phrase, I don't think it's a bug that it's not.

        What is a problem though is the lack of ability for the user to add additional context to fix the issue (i.e. a SynonymFilter to manually map "wi fi" wouldn't work since it would get "wi" and then "fi" in separate runs.

        What is also the problem is that if the original doc contained "wifi" then a query of "wi-fi" won't match (since it queries for "wi fi"). We work around this today (for people that really need it) by indexing a second field that catenates instead of splits the parts of a split token). It's certainly not ideal, but people tend to be happy with the cases we can match.

        So while our current system is far from perfect (and we should work on improving it).
        The problem is not that we have an incorrect solution, but an incomplete solution.
        Let's assume we had a QP that didn't split on whitespace (or whatever our optimal solution is).
        IMO, I would still want tokens joined by a dash to form a phrase query, just like tokens surrounded by quotes.
        It's important information and shouldn't be discarded.

        there's no evidence auto-phrase-gen actually improves relevance even for English.

        IMO, it's a case of "the customer is always right". Many people have asked how to do this sort of matching over the years and I think there is plenty of evidence that it increases relevancy.

        Maybe we insert a "cp {english,cjk}schema.xml schema.xml" in between those two steps? This would avoid the global default, ie, force an explicit choice.

        And the tutorial that's in english would tell them to copy the english one... that only hurts english speakers and doesn't help anyone else..
        We can have different text field types in a single schema - it's just a matter of adding another one that's good for non-whitespace delimited languages?

        Show
        Yonik Seeley added a comment - is wi fi, then this will not turn into a phrase. Right - but there's just a lack of information that can't be helped? So while one might want stuff like this as a phrase, I don't think it's a bug that it's not. What is a problem though is the lack of ability for the user to add additional context to fix the issue (i.e. a SynonymFilter to manually map "wi fi" wouldn't work since it would get "wi" and then "fi" in separate runs. What is also the problem is that if the original doc contained "wifi" then a query of "wi-fi" won't match (since it queries for "wi fi"). We work around this today (for people that really need it) by indexing a second field that catenates instead of splits the parts of a split token). It's certainly not ideal, but people tend to be happy with the cases we can match. So while our current system is far from perfect (and we should work on improving it). The problem is not that we have an incorrect solution, but an incomplete solution. Let's assume we had a QP that didn't split on whitespace (or whatever our optimal solution is). IMO, I would still want tokens joined by a dash to form a phrase query, just like tokens surrounded by quotes. It's important information and shouldn't be discarded. there's no evidence auto-phrase-gen actually improves relevance even for English. IMO, it's a case of "the customer is always right". Many people have asked how to do this sort of matching over the years and I think there is plenty of evidence that it increases relevancy. Maybe we insert a "cp {english,cjk}schema.xml schema.xml" in between those two steps? This would avoid the global default, ie, force an explicit choice. And the tutorial that's in english would tell them to copy the english one... that only hurts english speakers and doesn't help anyone else.. We can have different text field types in a single schema - it's just a matter of adding another one that's good for non-whitespace delimited languages?
        Hide
        Robert Muir added a comment -

        Many people have asked how to do this sort of matching over the years and I think there is plenty of evidence that it increases relevancy.

        You still haven't provided any evidence.

        it's just a matter of adding another one that's good for non-whitespace delimited languages?

        There isn't a single tokenizer that is good for all these languages. ICUTokenizer is ok on average for these, but its not integrated.
        I think we should add examples for all languages instead. The problem affects some whitespace-delimited languages, too.

        Show
        Robert Muir added a comment - Many people have asked how to do this sort of matching over the years and I think there is plenty of evidence that it increases relevancy. You still haven't provided any evidence. it's just a matter of adding another one that's good for non-whitespace delimited languages? There isn't a single tokenizer that is good for all these languages. ICUTokenizer is ok on average for these, but its not integrated. I think we should add examples for all languages instead. The problem affects some whitespace-delimited languages, too.
        Hide
        Yonik Seeley added a comment - - edited

        What would the fieldType for a generic international field look like?
        If we can decide on that, we could add it at least.

        edit: paths crossed - I see you answered that above.

        Show
        Yonik Seeley added a comment - - edited What would the fieldType for a generic international field look like? If we can decide on that, we could add it at least. edit: paths crossed - I see you answered that above.
        Hide
        Robert Muir added a comment -

        What would the fieldType for a generic international field look like?

        All I am asking for is to add commented out text_XX examples for the languages we support?
        This shouldnt affect the time it takes to startup solr and would resolve my concerns.

        Show
        Robert Muir added a comment - What would the fieldType for a generic international field look like? All I am asking for is to add commented out text_XX examples for the languages we support? This shouldnt affect the time it takes to startup solr and would resolve my concerns.
        Hide
        Michael McCandless added a comment -

        The problem is not that we have an incorrect solution, but an incomplete solution.

        True, but... I think you're splitting hairs

        From the user's standpoint, auto-phrase is flakey – in some cases it
        works, in others it doesn't.

        Let's assume we had a QP that didn't split on whitespace (or whatever our optimal solution is).
        IMO, I would still want tokens joined by a dash to form a phrase query, just like tokens surrounded by quotes.
        It's important information and shouldn't be discarded.

        I agree we shouldn't discard a user's dashes – they are important.
        Google also treats wizard-of-oz as a phrase query (Uwe seems
        particularly fond of this!).

        Hmm though I just tried wizard-of-oz, wizard of oz, and "wizard of
        oz", and got 3 different sets of results, from Google... hmmm.

        We can have different text field types in a single schema - it's just a matter of adding another one that's good for non-whitespace delimited languages?

        OK this seems like a good solution for now, until we fix QP/Analyzer
        to do this "privately".

        Show
        Michael McCandless added a comment - The problem is not that we have an incorrect solution, but an incomplete solution. True, but... I think you're splitting hairs From the user's standpoint, auto-phrase is flakey – in some cases it works, in others it doesn't. Let's assume we had a QP that didn't split on whitespace (or whatever our optimal solution is). IMO, I would still want tokens joined by a dash to form a phrase query, just like tokens surrounded by quotes. It's important information and shouldn't be discarded. I agree we shouldn't discard a user's dashes – they are important. Google also treats wizard-of-oz as a phrase query (Uwe seems particularly fond of this!). Hmm though I just tried wizard-of-oz, wizard of oz, and "wizard of oz", and got 3 different sets of results, from Google... hmmm. We can have different text field types in a single schema - it's just a matter of adding another one that's good for non-whitespace delimited languages? OK this seems like a good solution for now, until we fix QP/Analyzer to do this "privately".
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development