Lucene - Core
  1. Lucene - Core
  2. LUCENE-2458

queryparser makes all CJK queries phrase queries regardless of analyzer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The queryparser automatically makes ALL CJK, Thai, Lao, Myanmar, Tibetan, ... queries into phrase queries, even though you didn't ask for one, and there isn't a way to turn this off.

      This completely breaks lucene for these languages, as it treats all queries like 'grep'.

      Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are chinese characters, you get a phrasequery of "a b c d". if you use cjk analyzer, its no better, its a phrasequery of "ab bc cd", and if you use smartchinese analyzer, you get a phrasequery like "ab cd". But the user didn't ask for one, and they cannot turn it off.

      The reason is that the code to form phrase queries is not internationally appropriate and assumes whitespace tokenization. If more than one token comes out of whitespace delimited text, its automatically a phrase query no matter what.

      The proposed patch fixes the core queryparser (with all backwards compat kept) to only form phrase queries when the double quote operator is used.

      Implementing subclasses can always extend the QP and auto-generate whatever kind of queries they want that might completely break search for languages they don't care about, but core general-purpose QPs should be language independent.

      1. LUCENE-2458.patch
        62 kB
        Robert Muir
      2. LUCENE-2458.patch
        57 kB
        Robert Muir
      3. LUCENE-2458.patch
        33 kB
        Robert Muir
      4. LUCENE-2458.patch
        24 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Robert Muir added a comment -

          Robert, it was your commit that changed the default behavior of Solr, and I disagree with that change.
          Technically, I could VETO - but I don't believe I have ever done a code-change veto, and I don't want to start now

          Yonik, i would rather you just VETO than heavy-commit the wrong changes.
          For example, if you said "robert, its annoying that for users with LUCENE_31 version in their solrconfig,
          I don't feel they don't have enough flexibility yet without going setting version to LUCENE_30. I feel that
          the parameter setting in SOLR-2015 should be incorporated into this issue"

          I mean, thats completely constructive!

          Instead, I'll try and be constructive by going to work on SOLR-2015 so we can at least configure it per-field.

          Man, I am willing to help with that also (though, i am not particularly a solr queryparser expert, I think we
          should expose these options to users that want them, instead of requiring them to depend on version-specific
          defaults). Just let me know how I can help, I want constructive progress.

          Show
          Robert Muir added a comment - Robert, it was your commit that changed the default behavior of Solr, and I disagree with that change. Technically, I could VETO - but I don't believe I have ever done a code-change veto, and I don't want to start now Yonik, i would rather you just VETO than heavy-commit the wrong changes. For example, if you said "robert, its annoying that for users with LUCENE_31 version in their solrconfig, I don't feel they don't have enough flexibility yet without going setting version to LUCENE_30. I feel that the parameter setting in SOLR-2015 should be incorporated into this issue" I mean, thats completely constructive! Instead, I'll try and be constructive by going to work on SOLR-2015 so we can at least configure it per-field. Man, I am willing to help with that also (though, i am not particularly a solr queryparser expert, I think we should expose these options to users that want them, instead of requiring them to depend on version-specific defaults). Just let me know how I can help, I want constructive progress.
          Hide
          Uwe Schindler added a comment -

          It's a new major version! Even 2 steps major 1.5 -> 3.1 and three steps to 4.0

          Show
          Uwe Schindler added a comment - It's a new major version! Even 2 steps major 1.5 -> 3.1 and three steps to 4.0
          Hide
          Yonik Seeley added a comment -

          Robert, it was your commit that changed the default behavior of Solr, and I disagree with that change.
          Technically, I could VETO - but I don't believe I have ever done a code-change veto, and I don't want to start now.
          Instead, I'll try and be constructive by going to work on SOLR-2015 so we can at least configure it per-field.

          Show
          Yonik Seeley added a comment - Robert, it was your commit that changed the default behavior of Solr, and I disagree with that change. Technically, I could VETO - but I don't believe I have ever done a code-change veto, and I don't want to start now. Instead, I'll try and be constructive by going to work on SOLR-2015 so we can at least configure it per-field.
          Hide
          Yonik Seeley added a comment -

          Furthermore, you should check if its set to index with "omitTF" and not autogenerate in that case either.

          Solr doesn't currently allow ommitting TF for text fields.
          It's good to keep in mind if we ever enable that though.

          Show
          Yonik Seeley added a comment - Furthermore, you should check if its set to index with "omitTF" and not autogenerate in that case either. Solr doesn't currently allow ommitting TF for text fields. It's good to keep in mind if we ever enable that though.
          Hide
          Robert Muir added a comment -

          Please stop committing all these wrong changes.

          Now i have to go revert 2 more commits.

          Show
          Robert Muir added a comment - Please stop committing all these wrong changes. Now i have to go revert 2 more commits.
          Hide
          Robert Muir added a comment -

          I think the best way forward is to add a CJK field to solr that defaults to the opposite behavior (i.e. treats split tokens as completely separate).

          I think this is completely wrong (besides its way more than CJK affected)

          You should consider a "european" field instead.

          Furthermore, you should check if its set to index with "omitTF" and not autogenerate in that case either.
          In trunk I think phrasequery will actually throw an exception in this case instead of silently failing:
          so the autogenerated queries can be very dangerous even for english.

          Show
          Robert Muir added a comment - I think the best way forward is to add a CJK field to solr that defaults to the opposite behavior (i.e. treats split tokens as completely separate). I think this is completely wrong (besides its way more than CJK affected) You should consider a "european" field instead. Furthermore, you should check if its set to index with "omitTF" and not autogenerate in that case either. In trunk I think phrasequery will actually throw an exception in this case instead of silently failing: so the autogenerated queries can be very dangerous even for english.
          Hide
          Yonik Seeley added a comment -

          Before, one ctor used the version specified, the other hardcoded LUCENE_24.

          Ah.... and that constructor is the one that's used everywhere in Solr (leading me to believe that leaving Solr's default alone was deliberate).

          As i said before, this shouldnt and cannot be "per-token" and such english centric hacks do not belong in the analysis api.

          The ability of a filter to say "this token is actually indexed as two adjacent tokens" is fundamental and not related to any specific language.
          It can be used for language specific hacks perhaps... but it is not a hack itself.

          I never mentioned issues of back compat, but of changes to Solr's default behavior, which I continue to think is the best.
          I think the best way forward is to add a CJK field to solr that defaults to the opposite behavior (i.e. treats split tokens as completely separate).

          Show
          Yonik Seeley added a comment - Before, one ctor used the version specified, the other hardcoded LUCENE_24. Ah.... and that constructor is the one that's used everywhere in Solr (leading me to believe that leaving Solr's default alone was deliberate). As i said before, this shouldnt and cannot be "per-token" and such english centric hacks do not belong in the analysis api. The ability of a filter to say "this token is actually indexed as two adjacent tokens" is fundamental and not related to any specific language. It can be used for language specific hacks perhaps... but it is not a hack itself. I never mentioned issues of back compat, but of changes to Solr's default behavior, which I continue to think is the best. I think the best way forward is to add a CJK field to solr that defaults to the opposite behavior (i.e. treats split tokens as completely separate).
          Hide
          Uwe Schindler added a comment -

          I don't userstand the backwards issue. We did not change backwards, as the default for luceneMatchVersion is still 2.4 (so anybody using his old solrconfig will get exactly the same behaviour). New users are new and can use the new default. If they don't like it, they can change the default version or Koji's parameter.

          The idea about having this in the TS is not bad, but in my opinion too special. But you are right, the tokenizer should report suche concenated words. But on the other hand, you can simply use another tokenizer, that does not split your product numbers.

          Show
          Uwe Schindler added a comment - I don't userstand the backwards issue. We did not change backwards, as the default for luceneMatchVersion is still 2.4 (so anybody using his old solrconfig will get exactly the same behaviour). New users are new and can use the new default. If they don't like it, they can change the default version or Koji's parameter. The idea about having this in the TS is not bad, but in my opinion too special. But you are right, the tokenizer should report suche concenated words. But on the other hand, you can simply use another tokenizer, that does not split your product numbers.
          Hide
          Robert Muir added a comment -

          The patch doesnt change solr's default, it instead causes SolrQueryParser to respect the version parameter in the solrconfig in both ctors.
          Before, one ctor used the version specified, the other hardcoded LUCENE_24.

          As i said before, this shouldnt and cannot be "per-token" and such english centric hacks do not belong in the analysis api.

          Separately, I think what Koji is doing on SOLR-2015 is the way to go, not hardcoding LUCENE_24 as a version despite what is in the config.

          I don't think this english-centric hack should be the default for Solr. It does completely respect old schemas and is completely backwards compatible,
          such that if you have no version in your schema it will be LUCENE_24 and get the old behavior,
          if you made your own queryparser and subclassed the old API, you get the old behavior, and it respects the version set in the solrconfig rather than overriding it to 2.4 in just one ctor.

          Show
          Robert Muir added a comment - The patch doesnt change solr's default, it instead causes SolrQueryParser to respect the version parameter in the solrconfig in both ctors. Before, one ctor used the version specified, the other hardcoded LUCENE_24. As i said before, this shouldnt and cannot be "per-token" and such english centric hacks do not belong in the analysis api. Separately, I think what Koji is doing on SOLR-2015 is the way to go, not hardcoding LUCENE_24 as a version despite what is in the config. I don't think this english-centric hack should be the default for Solr. It does completely respect old schemas and is completely backwards compatible, such that if you have no version in your schema it will be LUCENE_24 and get the old behavior, if you made your own queryparser and subclassed the old API, you get the old behavior, and it respects the version set in the solrconfig rather than overriding it to 2.4 in just one ctor.
          Hide
          Yonik Seeley added a comment -

          The previous patches posted in this issue did not change Solr's default (else I would have objected earlier).
          From past discussions, it seems like a majority of people feel like the default should be the old behavior for Solr (this default matters much less for Lucene developers).
          We need to find a way to enable the new behavior per-field (although I feel that's a bit of a hack too... it really should be per-token).

          Show
          Yonik Seeley added a comment - The previous patches posted in this issue did not change Solr's default (else I would have objected earlier). From past discussions, it seems like a majority of people feel like the default should be the old behavior for Solr (this default matters much less for Lucene developers). We need to find a way to enable the new behavior per-field (although I feel that's a bit of a hack too... it really should be per-token).
          Hide
          Robert Muir added a comment -

          Please revert.

          Show
          Robert Muir added a comment - Please revert.
          Hide
          Yonik Seeley added a comment -

          I've reverted just the change of default behavior to Solr's QP.
          There are too many negative side-effects to change this given the way Solr is currently used (and documented to behave).
          We need to work on (at a minimum) a per-field config for Solr, but it seems like per-token is still the right way long term.

          Show
          Yonik Seeley added a comment - I've reverted just the change of default behavior to Solr's QP. There are too many negative side-effects to change this given the way Solr is currently used (and documented to behave). We need to work on (at a minimum) a per-field config for Solr, but it seems like per-token is still the right way long term.
          Hide
          Robert Muir added a comment -

          Lucene's problem is, that it does not take position for scoring into account, so documents where the tokens appear next to each other do not score higher (in contract to google, which supports those combined tokens).

          Actually no, the problem is, this autogeneration never really worked anyway, it was broken from the beginning.
          e.g. http://www.lucidimagination.com/search/document/bacf34995067e3cb/worddelimiterfilter_and_phrase_queries

          Show
          Robert Muir added a comment - Lucene's problem is, that it does not take position for scoring into account, so documents where the tokens appear next to each other do not score higher (in contract to google, which supports those combined tokens). Actually no, the problem is, this autogeneration never really worked anyway, it was broken from the beginning. e.g. http://www.lucidimagination.com/search/document/bacf34995067e3cb/worddelimiterfilter_and_phrase_queries
          Hide
          Uwe Schindler added a comment -

          QP has now a public final void setAutoGeneratePhraseQueries(boolean value). The default value is the one coming from Version parameter, but you can easily change it (like e.g. me, often working with such type of product numbers but never CJK text) can easily use this behaviour. Lucene's problem is, that it does not take position for scoring into account, so documents where the tokens appear next to each other do not score higher (in contract to google, which supports those combined tokens).

          Show
          Uwe Schindler added a comment - QP has now a public final void setAutoGeneratePhraseQueries(boolean value). The default value is the one coming from Version parameter, but you can easily change it (like e.g. me, often working with such type of product numbers but never CJK text) can easily use this behaviour. Lucene's problem is, that it does not take position for scoring into account, so documents where the tokens appear next to each other do not score higher (in contract to google, which supports those combined tokens).
          Hide
          Koji Sekiguchi added a comment -

          I agree with Koji's idea of adding a config hook for autoGeneratePhraseQueries for those that want it though, but i don't think it should be on by default either.

          Thanks. I'll open a ticket for it.

          Show
          Koji Sekiguchi added a comment - I agree with Koji's idea of adding a config hook for autoGeneratePhraseQueries for those that want it though, but i don't think it should be on by default either. Thanks. I'll open a ticket for it.
          Hide
          Robert Muir added a comment -

          Perhaps we should switch the SolrQueryParser back to using version==LUCENE_24 (or LUCENE_29 would work too)?

          I dont think we should do this. the whole point of this issue was that this auto-generation is a bad default, e.g. every thai query is a phrase query.
          I agree with Koji's idea of adding a config hook for autoGeneratePhraseQueries for those that want it though, but i don't think it should be on by default either.

          Show
          Robert Muir added a comment - Perhaps we should switch the SolrQueryParser back to using version==LUCENE_24 (or LUCENE_29 would work too)? I dont think we should do this. the whole point of this issue was that this auto-generation is a bad default, e.g. every thai query is a phrase query. I agree with Koji's idea of adding a config hook for autoGeneratePhraseQueries for those that want it though, but i don't think it should be on by default either.
          Hide
          Robert Muir added a comment -

          The change is backwards compatible... it fully respects the version in solrconfig.xml (as it should)

          Show
          Robert Muir added a comment - The change is backwards compatible... it fully respects the version in solrconfig.xml (as it should)
          Hide
          Yonik Seeley added a comment -

          As Koji noticed, it looks like what was committed accidentally changed the default behavior of solr (i.e. the last attached patch didn't but what was committed did).
          A query of pdp-11 now results in text:pdp OR text:11 instead of text:"pdp 11"

          Perhaps we should switch the SolrQueryParser back to using version==LUCENE_24 (or LUCENE_29 would work too)?

          Show
          Yonik Seeley added a comment - As Koji noticed, it looks like what was committed accidentally changed the default behavior of solr (i.e. the last attached patch didn't but what was committed did). A query of pdp-11 now results in text:pdp OR text:11 instead of text:"pdp 11" Perhaps we should switch the SolrQueryParser back to using version==LUCENE_24 (or LUCENE_29 would work too)?
          Hide
          Robert Muir added a comment -

          Committed revision 965585 / 965592 (3x)

          Show
          Robert Muir added a comment - Committed revision 965585 / 965592 (3x)
          Hide
          Robert Muir added a comment -

          ok, will commit this in a few days.

          Show
          Robert Muir added a comment - ok, will commit this in a few days.
          Hide
          Robert Muir added a comment -

          attached is an updated patch, with the latest proposal (for 3x branch).

          A boolean option, getter, and setter is added, and it defaults to false only for >= 3.1
          Also added a test for the toggle and some javadocs.

          all tests pass.

          Show
          Robert Muir added a comment - attached is an updated patch, with the latest proposal (for 3x branch). A boolean option, getter, and setter is added, and it defaults to false only for >= 3.1 Also added a test for the toggle and some javadocs. all tests pass.
          Hide
          Mark Miller added a comment -

          +1

          Show
          Mark Miller added a comment - +1
          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Uwe Schindler added a comment -

          +1 to latest proposal!

          Show
          Uwe Schindler added a comment - +1 to latest proposal!
          Hide
          Robert Muir added a comment -

          ok i revised the idea here:

          in 3.1 we supply a patch that looks like this one, except, there is a simple boolean toggle too.
          if you want per-field behavior or more explicit customization, you can subclass.
          the simple toggle is for non-subclassers.

          in 4.0 we do the same thing, for now.

          we open a separate issue where we replace the QP with something better (that does not split on whitespace
          at all and allows multi-word syns, n-gram tokenization, vietnamese, etc to work)

          as part of that issue take the existing one (with per-field toggle) and call it classicqueryparser or whatever.

          Show
          Robert Muir added a comment - ok i revised the idea here: in 3.1 we supply a patch that looks like this one, except, there is a simple boolean toggle too. if you want per-field behavior or more explicit customization, you can subclass. the simple toggle is for non-subclassers. in 4.0 we do the same thing, for now. we open a separate issue where we replace the QP with something better (that does not split on whitespace at all and allows multi-word syns, n-gram tokenization, vietnamese, etc to work) as part of that issue take the existing one (with per-field toggle) and call it classicqueryparser or whatever.
          Hide
          Yonik Seeley added a comment -

          In this case, too, you could override the default with subclassing.

          True - but I think some were saying the default should be configurable w/o subclassing.

          but we can add an explicit boolean option for those that don't subclass?

          Right. I think everyone is essentially saying the same thing at this point (at the high level)?
          Make it configurable (per-parser), and allow the user to handle per-field variations via subclassing.

          Show
          Yonik Seeley added a comment - In this case, too, you could override the default with subclassing. True - but I think some were saying the default should be configurable w/o subclassing. but we can add an explicit boolean option for those that don't subclass? Right. I think everyone is essentially saying the same thing at this point (at the high level)? Make it configurable (per-parser), and allow the user to handle per-field variations via subclassing.
          Hide
          Robert Muir added a comment -

          True... I'm fine with subclassing - I guess the only diff is if the default is configurable or set only via version.

          In this case, too, you could override the default with subclassing.

          @Override
          public Query getFieldQuery(String field, String text, boolean quoted) {
            return super.getFieldQuery(field, text, true /* treat it as if it were quoted */);
          }
          

          but we can add an explicit boolean option for those that don't subclass?

          Show
          Robert Muir added a comment - True... I'm fine with subclassing - I guess the only diff is if the default is configurable or set only via version. In this case, too, you could override the default with subclassing. @Override public Query getFieldQuery( String field, String text, boolean quoted) { return super .getFieldQuery(field, text, true /* treat it as if it were quoted */); } but we can add an explicit boolean option for those that don't subclass?
          Hide
          Yonik Seeley added a comment -

          True, but I thought there was something about dealing with this via subclassing you didnt like?
          With the current patch (with no option at all) you could do this per-field behavior with subclassing already:

          True... I'm fine with subclassing - I guess the only diff is if the default is configurable or set only via version.

          Show
          Yonik Seeley added a comment - True, but I thought there was something about dealing with this via subclassing you didnt like? With the current patch (with no option at all) you could do this per-field behavior with subclassing already: True... I'm fine with subclassing - I guess the only diff is if the default is configurable or set only via version.
          Hide
          Robert Muir added a comment -

          The easiest way might be to just make it configurable per-parser but changeable at any point in time (i.e. just a getter and a setter).
          Per-field can be handled via subclassing (same way per-field differences are handled for everything else with this QP).

          True, but I thought there was something about dealing with this via subclassing you didnt like?

          With the current patch (with no option at all) you could do this per-field behavior with subclassing already:

          @Override
          public Query getFieldQuery(String field, String text, boolean quoted) {
           if (field.equals("foobar"))
             return super.getFieldQuery(field, text, true /* treat it as if it were quoted */ );
           else
            return super.getFieldQuery(field, text, quoted);
          }
          
          Show
          Robert Muir added a comment - The easiest way might be to just make it configurable per-parser but changeable at any point in time (i.e. just a getter and a setter). Per-field can be handled via subclassing (same way per-field differences are handled for everything else with this QP). True, but I thought there was something about dealing with this via subclassing you didnt like? With the current patch (with no option at all) you could do this per-field behavior with subclassing already: @Override public Query getFieldQuery( String field, String text, boolean quoted) { if (field.equals( "foobar" )) return super .getFieldQuery(field, text, true /* treat it as if it were quoted */ ); else return super .getFieldQuery(field, text, quoted); }
          Hide
          Uwe Schindler added a comment -

          +1 @ rmuir & yonik

          I tend to make the per field thing also only by subclassing. If we add a per field property here, we also need things like: switch on date format, numeric type per field.

          A new syntax for the parser in 4.0 is essential in my opinion!

          Show
          Uwe Schindler added a comment - +1 @ rmuir & yonik I tend to make the per field thing also only by subclassing. If we add a per field property here, we also need things like: switch on date format, numeric type per field. A new syntax for the parser in 4.0 is essential in my opinion!
          Hide
          Yonik Seeley added a comment -

          in 3.1 we supply a patch that looks like this one, except, there is a toggle too. this toggle can be per-field.

          The easiest way might be to just make it configurable per-parser but changeable at any point in time (i.e. just a getter and a setter).
          Per-field can be handled via subclassing (same way per-field differences are handled for everything else with this QP).

          Show
          Yonik Seeley added a comment - in 3.1 we supply a patch that looks like this one, except, there is a toggle too. this toggle can be per-field. The easiest way might be to just make it configurable per-parser but changeable at any point in time (i.e. just a getter and a setter). Per-field can be handled via subclassing (same way per-field differences are handled for everything else with this QP).
          Hide
          Robert Muir added a comment -

          i would suggest the following, a combination of mike, mark's, and yonik's ideas:

          • in 3.1 we supply a patch that looks like this one, except, there is a toggle too. this toggle can be per-field.
          • in 4.0 we do the same thing, for now.

          we open a separate issue where we replace the QP with something better (that does not split on whitespace
          at all and allows multi-word syns, n-gram tokenization, vietnamese, etc to work)

          as part of that issue take the existing one (with per-field toggle) and call it classicqueryparser or whatever.

          i think we should consider the second separate issue carefully and take more time. I personally would prefer
          if somehow we could "tone down" the flexible queryparser (as far as number of classes, attributes, etc) and
          use that as a base. There are things about it I like, and things I don't like, but overall it seems to be a better
          starting point for such a thing.

          Show
          Robert Muir added a comment - i would suggest the following, a combination of mike, mark's, and yonik's ideas: in 3.1 we supply a patch that looks like this one, except, there is a toggle too. this toggle can be per-field. in 4.0 we do the same thing, for now. we open a separate issue where we replace the QP with something better (that does not split on whitespace at all and allows multi-word syns, n-gram tokenization, vietnamese, etc to work) as part of that issue take the existing one (with per-field toggle) and call it classicqueryparser or whatever. i think we should consider the second separate issue carefully and take more time. I personally would prefer if somehow we could "tone down" the flexible queryparser (as far as number of classes, attributes, etc) and use that as a base. There are things about it I like, and things I don't like, but overall it seems to be a better starting point for such a thing.
          Hide
          Mark Miller added a comment -

          How about making the setting ("if analyzer returns more than 1 token for a
          single chunk of whitespace-separated text, make a PhraseQuery")
          configurable (instead of hardwired according to Version)? And defaulting it
          to off for Version >= 31 (so CJK, etc., work out of the box)?

          I think its pretty clear this would make most people happy.

          Personally, I'm somewhat on board with Robert that this may really hamstring us when it comes to further fixes that are needed/wanted in the future.

          To note though - I think in general, most who have commented on this issue are into making CJK work out of the box. But I really think we need to nail down more consensus on this first.

          At a minimum, I think making the behavior configurable, while defaulting to CJK 'betterness' by default has pretty much everyone on board.

          But I'd really like to discuss whether doing that will only lead to losing that option as we do things like stop qp from splitting on whitespace in the future...

          Something I was thinking, and it might be more of a maintenance headache than its worth, but we could demote this queryparser from the core query parser, and rename it something like ClassicQueryParser (or whatever), and make a new QueryParser that is better for more languages across the board (originally basing it on the classic parser eg this patch to start). People that like the older more english biased QueryParser can still use it, and by default, new users will likely pick up the default QueryParser that works better with more languages out of the box?

          Just an idea.

          In any event - I think this patch is a step forward too - but it looks to me like there are still open concerns and objections.

          Show
          Mark Miller added a comment - How about making the setting ("if analyzer returns more than 1 token for a single chunk of whitespace-separated text, make a PhraseQuery") configurable (instead of hardwired according to Version)? And defaulting it to off for Version >= 31 (so CJK, etc., work out of the box)? I think its pretty clear this would make most people happy. Personally, I'm somewhat on board with Robert that this may really hamstring us when it comes to further fixes that are needed/wanted in the future. To note though - I think in general, most who have commented on this issue are into making CJK work out of the box. But I really think we need to nail down more consensus on this first. At a minimum, I think making the behavior configurable, while defaulting to CJK 'betterness' by default has pretty much everyone on board. But I'd really like to discuss whether doing that will only lead to losing that option as we do things like stop qp from splitting on whitespace in the future... Something I was thinking, and it might be more of a maintenance headache than its worth, but we could demote this queryparser from the core query parser, and rename it something like ClassicQueryParser (or whatever), and make a new QueryParser that is better for more languages across the board (originally basing it on the classic parser eg this patch to start). People that like the older more english biased QueryParser can still use it, and by default, new users will likely pick up the default QueryParser that works better with more languages out of the box? Just an idea. In any event - I think this patch is a step forward too - but it looks to me like there are still open concerns and objections.
          Hide
          Michael McCandless added a comment -

          How about making the setting ("if analyzer returns more than 1 token for a
          single chunk of whitespace-separated text, make a PhraseQuery")
          configurable (instead of hardwired according to Version)? And defaulting it
          to off for Version >= 31 (so CJK, etc., work out of the box)?

          Show
          Michael McCandless added a comment - How about making the setting ("if analyzer returns more than 1 token for a single chunk of whitespace-separated text, make a PhraseQuery") configurable (instead of hardwired according to Version)? And defaulting it to off for Version >= 31 (so CJK, etc., work out of the box)?
          Hide
          Michael McCandless added a comment -

          I think the latest patch is an OK step forward. Yeah it's not
          perfect, but it's better than what we have today. Progress not
          perfection.

          QP (and more generally all of Lucene's core) should aim to be language
          neutral, and it's not now. This patch improves that (defaulting the
          auto-PhraseQuery to off, by version).

          Yes, compound/syn handling in English should be available, but the way
          QP does this is really quite broken (the analyzer can't do multi-words
          syns).

          There are deep questions/ideas about how to properly do multi-words
          syns/compounds (see LUCENE-1622 – we really need to change the index,
          to store "span" of a token). I'm no longer sure that having analyzer
          mark tokens that should be made into PhraseQuery is the right
          approach... eg with the contrib QP, a plugin could directly tweak the
          query tree, instead of being forced to use a token stream to
          communicate this.

          Show
          Michael McCandless added a comment - I think the latest patch is an OK step forward. Yeah it's not perfect, but it's better than what we have today. Progress not perfection. QP (and more generally all of Lucene's core) should aim to be language neutral, and it's not now. This patch improves that (defaulting the auto-PhraseQuery to off, by version). Yes, compound/syn handling in English should be available, but the way QP does this is really quite broken (the analyzer can't do multi-words syns). There are deep questions/ideas about how to properly do multi-words syns/compounds (see LUCENE-1622 – we really need to change the index, to store "span" of a token). I'm no longer sure that having analyzer mark tokens that should be made into PhraseQuery is the right approach... eg with the contrib QP, a plugin could directly tweak the query tree, instead of being forced to use a token stream to communicate this.
          Hide
          Yonik Seeley added a comment -

          Let's remember that the bug is "queryparser makes all CJK queries phrase queries regardless of analyzer".
          The ability of an analysis chain to create phrase queries in conjunction with the query parser is a feature.
          The obvious way to reconcile these opposing statements is to make it configurable.
          Per-parser is a bare minimum... per-field would be better... and per-token would be best.

          I won't repeat my previous comments in this thread - however those arguments are still valid.

          Show
          Yonik Seeley added a comment - Let's remember that the bug is "queryparser makes all CJK queries phrase queries regardless of analyzer". The ability of an analysis chain to create phrase queries in conjunction with the query parser is a feature. The obvious way to reconcile these opposing statements is to make it configurable. Per-parser is a bare minimum... per-field would be better... and per-token would be best. I won't repeat my previous comments in this thread - however those arguments are still valid.
          Hide
          Mark Miller added a comment -

          I think we should continue working out what's best here.

          I don't think its wise to try and bully through contentious issues. There should be consensus before something happens here - barring that, some kind of vote makes sense IMO.

          Show
          Mark Miller added a comment - I think we should continue working out what's best here. I don't think its wise to try and bully through contentious issues. There should be consensus before something happens here - barring that, some kind of vote makes sense IMO.
          Hide
          Robert Muir added a comment -

          I've got to -1 this commit.

          As mentioned on apache's website:

          To prevent vetos from being used capriciously, they must be accompanied by a technical justification showing why the change is bad (opens a security exposure, negatively affects performance, etc.). A veto without a justification is invalid and has no weight.
          

          No one has been able to provide any technical justifications, only subjective opinions.

          When standard test collections were used, it was shown that this behavior significant hurts CJK and delivers only 10% of standard IR techniques (not generating phrases but using boolean word/bigram queries). See Ivan's results above. This isn't surprising since CJK IR has been pretty well studied, there is nothing new here.

          At the same time, when english test collections were used, there was no difference, on the contrary, it only tended to slightly improve relevance for english, too.

          Why do we even bother trying to start an openrelevance project if people do not want to go with the scientific method but prefer subjective opinion?

          Show
          Robert Muir added a comment - I've got to -1 this commit. As mentioned on apache's website: To prevent vetos from being used capriciously, they must be accompanied by a technical justification showing why the change is bad (opens a security exposure, negatively affects performance, etc.). A veto without a justification is invalid and has no weight. No one has been able to provide any technical justifications, only subjective opinions. When standard test collections were used, it was shown that this behavior significant hurts CJK and delivers only 10% of standard IR techniques (not generating phrases but using boolean word/bigram queries). See Ivan's results above. This isn't surprising since CJK IR has been pretty well studied, there is nothing new here. At the same time, when english test collections were used, there was no difference, on the contrary, it only tended to slightly improve relevance for english, too. Why do we even bother trying to start an openrelevance project if people do not want to go with the scientific method but prefer subjective opinion?
          Hide
          Koji Sekiguchi added a comment -

          +1 to revert. Though I am a late comer (as always ) and I just read the updated Description, the example behavior of QueryParser for CJK (abcd -> "ab bc cd") looks correct to me and I'm using QP with CJK as is.

          Show
          Koji Sekiguchi added a comment - +1 to revert. Though I am a late comer (as always ) and I just read the updated Description, the example behavior of QueryParser for CJK (abcd -> "ab bc cd") looks correct to me and I'm using QP with CJK as is.
          Hide
          Uwe Schindler added a comment -

          Revert! Revert! Revert!

          By the way, matchVersion should be final. I also like to have a separate setter for the auto-phrase functionality. That should be easy possible!

          Show
          Uwe Schindler added a comment - Revert! Revert! Revert! By the way, matchVersion should be final. I also like to have a separate setter for the auto-phrase functionality. That should be easy possible!
          Hide
          Mark Miller added a comment -

          I know there was more discussion on this in IRC, but I don't see consensus in the issue. I also don't see the issues brought up having been addressed or worked out.

          I've got to -1 this commit. I even think I may be convinced that making this an option will make future improvements we may want too difficult - but nothing has been hammered out in this JIRA issue. It looks like those that have brought up various points have just been ignored.

          -1.

          Show
          Mark Miller added a comment - I know there was more discussion on this in IRC, but I don't see consensus in the issue. I also don't see the issues brought up having been addressed or worked out. I've got to -1 this commit. I even think I may be convinced that making this an option will make future improvements we may want too difficult - but nothing has been hammered out in this JIRA issue. It looks like those that have brought up various points have just been ignored. -1.
          Hide
          Mark Miller added a comment -

          For all the debate around this change, that was a pretty fast commit IMO ...

          Show
          Mark Miller added a comment - For all the debate around this change, that was a pretty fast commit IMO ...
          Hide
          Robert Muir added a comment -

          Committed revision 948326 (trunk) / 948325 (3x)

          Show
          Robert Muir added a comment - Committed revision 948326 (trunk) / 948325 (3x)
          Hide
          Robert Muir added a comment -

          This patch fixes the bug in all queryparsers. I plan to commit soon.

          If desired, someone can make their own euro-centric queryparser in the contrib section and I have no objection, as long as its clearly documented that its unsuitable for many languages (just like the JDK does).

          Show
          Robert Muir added a comment - This patch fixes the bug in all queryparsers. I plan to commit soon. If desired, someone can make their own euro-centric queryparser in the contrib section and I have no objection, as long as its clearly documented that its unsuitable for many languages (just like the JDK does).
          Hide
          Robert Muir added a comment -

          editing this issue to make it easier to understand.

          Show
          Robert Muir added a comment - editing this issue to make it easier to understand.
          Hide
          Robert Muir added a comment -

          updated patch that cuts over the remaining two qps: the flexible queryparser and precedence queryparser

          Show
          Robert Muir added a comment - updated patch that cuts over the remaining two qps: the flexible queryparser and precedence queryparser
          Hide
          Uwe Schindler added a comment -

          Hi Robert,

          I also agree with Mark (as you know). We can have both:

          • Version for a good default (3.1 will get the new non-phrase-query behavior)
          • A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever)

          This would give you the best from both worlds.

          Show
          Uwe Schindler added a comment - Hi Robert, I also agree with Mark (as you know). We can have both: Version for a good default (3.1 will get the new non-phrase-query behavior) A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds.
          Hide
          Shai Erera added a comment -

          There will be tons of different opinions to that around linguists around the world but this parser is not for linguists in the first place.

          Sure. I've pointed that out just to show there are different opinions around that particular problem.

          It seems to me making this behavior available with Version is the right way to go

          I disagree (and agree w/ Mark). Version can only control default behavior. This particular issue should be a setter IMO. Irregardless of what the default behavior is, I may want to set it differently. It doesn't make sense to guess what Version should I use in order to get that behavior.

          That's why I don't mind leaving the current behavior as default, and introduce a setter for whoever wants to change it. The current behavior is not applicable for just English - I bet there's a whole list of languages which would interpret that query the same (i.e. require a phrase to be generated).

          And I don't know the distribution of Lucene users around the world, but I'm not sure that CJK users are more common that say English ones, or other European languages. So who knows what a good default is?

          I suggest we leave the default as it is now, and introduce a setter. People have been working w/ the parser and that default for a long time. Why suddenly change it?

          Show
          Shai Erera added a comment - There will be tons of different opinions to that around linguists around the world but this parser is not for linguists in the first place. Sure. I've pointed that out just to show there are different opinions around that particular problem. It seems to me making this behavior available with Version is the right way to go I disagree (and agree w/ Mark). Version can only control default behavior. This particular issue should be a setter IMO. Irregardless of what the default behavior is, I may want to set it differently. It doesn't make sense to guess what Version should I use in order to get that behavior. That's why I don't mind leaving the current behavior as default, and introduce a setter for whoever wants to change it. The current behavior is not applicable for just English - I bet there's a whole list of languages which would interpret that query the same (i.e. require a phrase to be generated). And I don't know the distribution of Lucene users around the world, but I'm not sure that CJK users are more common that say English ones, or other European languages. So who knows what a good default is? I suggest we leave the default as it is now, and introduce a setter. People have been working w/ the parser and that default for a long time. Why suddenly change it?
          Hide
          Simon Willnauer added a comment -

          Using Version or not is orthogonal to what the default is IMO though. That's why its important whether its considered a bug or an option - Version is not a good option selector at all.

          I disagree, if you upgrade to a new version of lucene and you already deal with version you wanna pass in the version you are upgradeing from and get the same behavior. On the other hand you wanna have a good default when you are new to lucene so Version >=3.1 should give you the new behavior. I guess having a hybrid approach where the given version sets the option sounds like a good compromise - or was that what you where alluding to? I am not saying that this should not be optional just wanna make sure we are consistent with what version is supposed to be used for.

          Show
          Simon Willnauer added a comment - Using Version or not is orthogonal to what the default is IMO though. That's why its important whether its considered a bug or an option - Version is not a good option selector at all. I disagree, if you upgrade to a new version of lucene and you already deal with version you wanna pass in the version you are upgradeing from and get the same behavior. On the other hand you wanna have a good default when you are new to lucene so Version >=3.1 should give you the new behavior. I guess having a hybrid approach where the given version sets the option sounds like a good compromise - or was that what you where alluding to? I am not saying that this should not be optional just wanna make sure we are consistent with what version is supposed to be used for.
          Hide
          Mark Miller added a comment -

          It seems to me making this behavior available with Version is the right way to go. I don't care if people call it a bug or a good default for US text - what count is to give people a good default no matter what they index or where they come from. (sounds like this is close to discriminating people - just kidding)

          Using Version or not is orthogonal to what the default is IMO though. That's why its important whether its considered a bug or an option - Version is not a good option selector at all.

          This is part of the goodness of stable/unstable - default options can change in unstable.

          Show
          Mark Miller added a comment - It seems to me making this behavior available with Version is the right way to go. I don't care if people call it a bug or a good default for US text - what count is to give people a good default no matter what they index or where they come from. (sounds like this is close to discriminating people - just kidding) Using Version or not is orthogonal to what the default is IMO though. That's why its important whether its considered a bug or an option - Version is not a good option selector at all. This is part of the goodness of stable/unstable - default options can change in unstable.
          Hide
          Simon Willnauer added a comment -

          FWIW, I agree w/ Mark. I don't think it's a bug, but more of a user option

          I would rather agree with Shai and Mark here but it is also not the behavior I would expect if I come to lucene the first time and use query parser. I still don't understand why this should not be done by using version. Exactly stuff like that where the reason to introduce it. Using Version to change runtime behavior has been done before (see CharTokenizer, CharacterUtils, LowerCaseFilter). IMO this is a totally valid usage and makes people aware of that something has changed. I also figured that Version seriously got adopted in the Community, people start using and appreciating it which somewhat changed my mind on Version.

          And for what's it's also worth, we've once worked w/ a Japanese linguist,

          There will be tons of different opinions to that around linguists around the world but this parser is not for linguists in the first place. It should give the majority of users a good DEFAULT and has never aimed to be a perfect solution. I usually recommend to misplease everybody equally than being biased towards a certain usergroup when providing an API which is not super special purpose.

          It seems to me making this behavior available with Version is the right way to go. I don't care if people call it a bug or a good default for US text - what count is to give people a good default no matter what they index or where they come from. (sounds like this is close to discriminating people - just kidding)

          Another thing i wanna mention is that QueryParser should really be in contrib / modules rather than in core. I don't know how many parsers we have in the meanwhile but we should really consolidate them in a new module for 4.0. Get the stuff out there, make a copy of the current parser, name it "SmartENBiasedAutomaticPhraseGeneratingQueryParser", fix that in the other one and provide people a good default.

          just my $0.05

          Show
          Simon Willnauer added a comment - FWIW, I agree w/ Mark. I don't think it's a bug, but more of a user option I would rather agree with Shai and Mark here but it is also not the behavior I would expect if I come to lucene the first time and use query parser. I still don't understand why this should not be done by using version. Exactly stuff like that where the reason to introduce it. Using Version to change runtime behavior has been done before (see CharTokenizer, CharacterUtils, LowerCaseFilter). IMO this is a totally valid usage and makes people aware of that something has changed. I also figured that Version seriously got adopted in the Community, people start using and appreciating it which somewhat changed my mind on Version. And for what's it's also worth, we've once worked w/ a Japanese linguist, There will be tons of different opinions to that around linguists around the world but this parser is not for linguists in the first place. It should give the majority of users a good DEFAULT and has never aimed to be a perfect solution. I usually recommend to misplease everybody equally than being biased towards a certain usergroup when providing an API which is not super special purpose. It seems to me making this behavior available with Version is the right way to go. I don't care if people call it a bug or a good default for US text - what count is to give people a good default no matter what they index or where they come from. (sounds like this is close to discriminating people - just kidding) Another thing i wanna mention is that QueryParser should really be in contrib / modules rather than in core. I don't know how many parsers we have in the meanwhile but we should really consolidate them in a new module for 4.0. Get the stuff out there, make a copy of the current parser, name it "SmartENBiasedAutomaticPhraseGeneratingQueryParser", fix that in the other one and provide people a good default. just my $0.05
          Hide
          Shai Erera added a comment -

          FWIW, I agree w/ Mark. I don't think it's a bug, but more of a user option. Whether it should be specified by a setter, or an extension of QP - I have no strong feelings for either of them, so either would be fine by me.

          And for what's it's also worth, we've once worked w/ a Japanese linguist, who suggested that we always convert queries like [abcd] to [abcd "abcd"] or just ["abcd"] because if someone had already bothered to write them like that, then phrase matching should contribute to the rank of the documents. IMO, if someone had gone even further by writing [field:abcd], then even if the query should be [field:a field:b field:c field:d], executing the query [field:"abcd"] is still important and better.

          So .. I'm not trying to argue what should be the default behavior, because that is subject to personal flavor and apps requirements – only to emphasize that there are many user cases out there, and we should cater for such scenarios.

          The extension way is already supported, right? So perhaps we just need to document the current behavior, and not change anything? Or, introduce a setter, that will do the simple thing - either keep it as a phrase or break it down to terms. More sophisticated scenarios can be dealt through extension.

          Show
          Shai Erera added a comment - FWIW, I agree w/ Mark. I don't think it's a bug, but more of a user option. Whether it should be specified by a setter, or an extension of QP - I have no strong feelings for either of them, so either would be fine by me. And for what's it's also worth, we've once worked w/ a Japanese linguist, who suggested that we always convert queries like [abcd] to [abcd "abcd"] or just ["abcd"] because if someone had already bothered to write them like that, then phrase matching should contribute to the rank of the documents. IMO, if someone had gone even further by writing [field:abcd] , then even if the query should be [field:a field:b field:c field:d] , executing the query [field:"abcd"] is still important and better. So .. I'm not trying to argue what should be the default behavior, because that is subject to personal flavor and apps requirements – only to emphasize that there are many user cases out there, and we should cater for such scenarios. The extension way is already supported, right? So perhaps we just need to document the current behavior, and not change anything? Or, introduce a setter, that will do the simple thing - either keep it as a phrase or break it down to terms. More sophisticated scenarios can be dealt through extension.
          Hide
          Mark Miller added a comment -

          I still don't think this falls under bug territory myself - which leads me to thinking that Version is not the correct way to handle it.

          The icecream example showing that this is not a 'perfect' solution even for english does not show its a bug in my opinion either.

          I still vote to make this an option. Or make another QueryParser that works with more languages, and I guess with less 'biased' english language operators. The whole idea of the new QP was to make that type of thing easy if I remember right.

          So users who want to emulate the English-optimized "forced PhraseQuery even when user didn't say so explicitly" can create QP

          This should be an option, not an emulation.

          Show
          Mark Miller added a comment - I still don't think this falls under bug territory myself - which leads me to thinking that Version is not the correct way to handle it. The icecream example showing that this is not a 'perfect' solution even for english does not show its a bug in my opinion either. I still vote to make this an option. Or make another QueryParser that works with more languages, and I guess with less 'biased' english language operators. The whole idea of the new QP was to make that type of thing easy if I remember right. So users who want to emulate the English-optimized "forced PhraseQuery even when user didn't say so explicitly" can create QP This should be an option, not an emulation.
          Hide
          Robert Muir added a comment -

          i added tests for CJK, and cut over all ? extends QueryParser classes.

          TODO still:

          • shouldnt depend on whitespace/term count for coord (so cjk synonyms work nice)
          • fix flexible and precedence queryparsers.
          Show
          Robert Muir added a comment - i added tests for CJK, and cut over all ? extends QueryParser classes. TODO still: shouldnt depend on whitespace/term count for coord (so cjk synonyms work nice) fix flexible and precedence queryparsers.
          Hide
          Robert Muir added a comment -

          But I don't think we can improve that much (the logic is must implement is fundamentally hairy).

          I do have some suggestions for the future (we could do when we can get rid of the deprecated behavior).
          But I couldn't implement anything below, because this change I am working on is already hairy enough itself

          The problem with getFieldQuery is that it does too much. One of the things it does is a bunch of query rewrites.
          I think a lot of this is silly, and I think if you search on "foo" its ok for the QP to call newPhraseQuery and return a 1-term PhraseQuery.
          This is, after all, what the query was, and it should be closer to a simple parser.
          PhraseQuery already has its own code to later rewrite() to a single TermQuery in this case.

          The other problem with this method is that it is trying to abuse a CachingTokenFilter to work with a TokenStream in 'extra dimensions' that it doesnt have.
          In my opinion, instead of doing this, it should just put the terms from the TokenStream in a more suitable data structure that allows the code to be less hairy,
          instead of using a CachingTokenFilter and reset()'ing it.

          Show
          Robert Muir added a comment - But I don't think we can improve that much (the logic is must implement is fundamentally hairy). I do have some suggestions for the future (we could do when we can get rid of the deprecated behavior). But I couldn't implement anything below, because this change I am working on is already hairy enough itself The problem with getFieldQuery is that it does too much. One of the things it does is a bunch of query rewrites. I think a lot of this is silly, and I think if you search on "foo" its ok for the QP to call newPhraseQuery and return a 1-term PhraseQuery. This is, after all, what the query was, and it should be closer to a simple parser. PhraseQuery already has its own code to later rewrite() to a single TermQuery in this case. The other problem with this method is that it is trying to abuse a CachingTokenFilter to work with a TokenStream in 'extra dimensions' that it doesnt have. In my opinion, instead of doing this, it should just put the terms from the TokenStream in a more suitable data structure that allows the code to be less hairy, instead of using a CachingTokenFilter and reset()'ing it.
          Hide
          Michael McCandless added a comment -

          Patch looks good Robert!

          You've added a new "boolean quoted" param, to getFieldQuery.

          It's true if the search has double quotes, to force a PhraseQuery. But if there are no double quotes, it's true when Version < 31, else false.

          And, when it's not quoted, you also fixed coord to be enabled for the BQ created, and for the operator to be respected in the BQ that's created (for the "Chinese" cases). The getFieldQuery method is now very scary But I don't think we can improve that much (the logic is must implement is fundamentally hairy).

          So users who want to emulate the English-optimized "forced PhraseQuery even when user didn't say so explicitly" can create QP with VERSION_30.

          Show
          Michael McCandless added a comment - Patch looks good Robert! You've added a new "boolean quoted" param, to getFieldQuery. It's true if the search has double quotes, to force a PhraseQuery. But if there are no double quotes, it's true when Version < 31, else false. And, when it's not quoted, you also fixed coord to be enabled for the BQ created, and for the operator to be respected in the BQ that's created (for the "Chinese" cases). The getFieldQuery method is now very scary But I don't think we can improve that much (the logic is must implement is fundamentally hairy). So users who want to emulate the English-optimized "forced PhraseQuery even when user didn't say so explicitly" can create QP with VERSION_30.
          Hide
          Michael McCandless added a comment -

          But i ask that we open a separate issue for this,
          as providing backwards compat for this separate problem seems like
          it might need "sophisticated backwards layer".

          I agree, let's handle this ("QP should not pre-split on whitespace but instead leave that to the analyzer since it's language specific") as a separate issue.

          Show
          Michael McCandless added a comment - But i ask that we open a separate issue for this, as providing backwards compat for this separate problem seems like it might need "sophisticated backwards layer". I agree, let's handle this ("QP should not pre-split on whitespace but instead leave that to the analyzer since it's language specific") as a separate issue.
          Hide
          Robert Muir added a comment -

          But there's no way to do this today, because QP pre-splits on
          whitespace, for ice cream the analyzer would separately receive ice
          and cream, so it never has a chance to detect this form of the
          compound.

          I completely agree with this... the hack hurts english too!

          But i ask that we open a separate issue for this,
          as providing backwards compat for this separate problem seems like
          it might need "sophisticated backwards layer".

          Solving this first problem can be done without any grammar modifications.

          Show
          Robert Muir added a comment - But there's no way to do this today, because QP pre-splits on whitespace, for ice cream the analyzer would separately receive ice and cream, so it never has a chance to detect this form of the compound. I completely agree with this... the hack hurts english too! But i ask that we open a separate issue for this, as providing backwards compat for this separate problem seems like it might need "sophisticated backwards layer". Solving this first problem can be done without any grammar modifications.
          Hide
          Robert Muir added a comment - - edited

          Attached is a patch that addresses most of the issue:

          NOTE: I do not tackle the 'split on whitespace' issue as this is a larger change and would require changes to the actual grammar itself. We can open a separate JIRA issue for this. This issue is about not generating phrase queries based on how many terms come out of the hardcoded whitespace-tokenizer in QueryParser.

          • The bug is preserved based on both Version and subclassing, so previous subclasses overriding getFieldQuery work just fine as before.
          • All lucene/solr tests pass, the backwards/ still uses LUCENE_CURRENT so i had to add a TEST_VERSION_CURRENT there, so that it runs the tests as LUCENE_30 (this is a problem in general).
          • This fixes a host of issues for CJK, not only do we let normal queries work, but phrase/sloppy phrase work correctly too if you use the double quotes. (because if you use PositionFilter hack you disable these!)
          • Normal CJK queries also get coord(), which if you use the PositionFilter hack is disabled too, because it treates the entire query as synonyms. CJK users should get coord() too, like english users, thanks to Ivan for testing this (additional 12% relevance boost on their test collection).
          • Normal CJK queries are formed by the queryparser default operator, same as english queries.
          • English synonym queries (1 term followed by multiple positions) still get coord() disabled as they should.
          • I only migrated one subclass to the new API, the "Extendable Query Parser", and only because its TestExtendable actually extends TestQueryParser from core (so it had to be done).

          TODO:

          • The coord-disabled-BQ code for english synonyms need to be generalized in case CJK users use synonyms, single terms followed by posinc=0 terms should form coord-disabled-BQ's just like they do for english.
          • Add some nice explicit tests for CJK queries, phrase queries, sloppy phrase queries, CJK queries with synonyms, etc.
          • Cutover remaining queryparsers

          edit: by the way this patch is for 3x, not trunk yet, because 3x has back-compat tests enabled

          Show
          Robert Muir added a comment - - edited Attached is a patch that addresses most of the issue: NOTE: I do not tackle the 'split on whitespace' issue as this is a larger change and would require changes to the actual grammar itself. We can open a separate JIRA issue for this. This issue is about not generating phrase queries based on how many terms come out of the hardcoded whitespace-tokenizer in QueryParser. The bug is preserved based on both Version and subclassing, so previous subclasses overriding getFieldQuery work just fine as before. All lucene/solr tests pass, the backwards/ still uses LUCENE_CURRENT so i had to add a TEST_VERSION_CURRENT there, so that it runs the tests as LUCENE_30 (this is a problem in general). This fixes a host of issues for CJK, not only do we let normal queries work, but phrase/sloppy phrase work correctly too if you use the double quotes. (because if you use PositionFilter hack you disable these!) Normal CJK queries also get coord(), which if you use the PositionFilter hack is disabled too, because it treates the entire query as synonyms. CJK users should get coord() too, like english users, thanks to Ivan for testing this (additional 12% relevance boost on their test collection). Normal CJK queries are formed by the queryparser default operator, same as english queries. English synonym queries (1 term followed by multiple positions) still get coord() disabled as they should. I only migrated one subclass to the new API, the "Extendable Query Parser", and only because its TestExtendable actually extends TestQueryParser from core (so it had to be done). TODO: The coord-disabled-BQ code for english synonyms need to be generalized in case CJK users use synonyms, single terms followed by posinc=0 terms should form coord-disabled-BQ's just like they do for english. Add some nice explicit tests for CJK queries, phrase queries, sloppy phrase queries, CJK queries with synonyms, etc. Cutover remaining queryparsers edit: by the way this patch is for 3x, not trunk yet, because 3x has back-compat tests enabled
          Hide
          Michael McCandless added a comment -

          OK mulling some more on this one...

          Even for english, the QP hack (pre-splitting on whitespace, then
          turning any text that analyzers to multiple tokens into a
          PhraseQuery), doesn't work right.

          EG, say I want ice-cream, ice cream and icecream to mean the same
          thing. Really I should do this (handling compounds) during indexing
          – I'll get better relevance and performance. But say for some reason
          I'm doing it at search time...

          I would want an analyzer that detects all three forms and in turn
          expands to all three forms in the query.

          But there's no way to do this today, because QP pre-splits on
          whitespace, for ice cream the analyzer would separately receive ice
          and cream, so it never has a chance to detect this form of the
          compound.

          So... first, I think we should fix QP to not pre-split on whitespace.
          QP really should be as language neutral as possible. It should only
          split on syntax chars, and send the whole string in between syntax
          chars to the analyzer.

          And, second, the QP should not create PhraseQuery when it sees
          multiple tokens come back. This obliterates the OOTB experience for
          non-whitespace languages. And, it doesn't work right for
          english... so I think we should deprecate the option and default it to
          "off".

          Really the contrib queryparser is a better fit for doing rewrites like
          this: it's able to operate on the abstract query tree, and can easily
          do things like rewriting the query to add phrase queries...

          Show
          Michael McCandless added a comment - OK mulling some more on this one... Even for english, the QP hack (pre-splitting on whitespace, then turning any text that analyzers to multiple tokens into a PhraseQuery), doesn't work right. EG, say I want ice-cream, ice cream and icecream to mean the same thing. Really I should do this (handling compounds) during indexing – I'll get better relevance and performance. But say for some reason I'm doing it at search time... I would want an analyzer that detects all three forms and in turn expands to all three forms in the query. But there's no way to do this today, because QP pre-splits on whitespace, for ice cream the analyzer would separately receive ice and cream, so it never has a chance to detect this form of the compound. So... first, I think we should fix QP to not pre-split on whitespace. QP really should be as language neutral as possible. It should only split on syntax chars, and send the whole string in between syntax chars to the analyzer. And, second, the QP should not create PhraseQuery when it sees multiple tokens come back. This obliterates the OOTB experience for non-whitespace languages. And, it doesn't work right for english... so I think we should deprecate the option and default it to "off". Really the contrib queryparser is a better fit for doing rewrites like this: it's able to operate on the abstract query tree, and can easily do things like rewriting the query to add phrase queries...
          Hide
          Robert Muir added a comment -

          An attribute that says "these tokens go together" or "these tokens should be considered one unit" seems like nice generic functionality, and is unrelated to any specific language or search feature.

          No, if they are one unit for search, they are one token.

          Instead the tokenizer should be fixed so that they are one token, instead of making all languages suffer for the lack of a crappy english tokenizer.

          Show
          Robert Muir added a comment - An attribute that says "these tokens go together" or "these tokens should be considered one unit" seems like nice generic functionality, and is unrelated to any specific language or search feature. No, if they are one unit for search, they are one token. Instead the tokenizer should be fixed so that they are one token, instead of making all languages suffer for the lack of a crappy english tokenizer.
          Hide
          Robert Muir added a comment -

          We're conflating high level user syntax and the underlying implementation.

          Then you have no problem if we form phrase queries for all adjacent english words, like we do for chinese.

          Perhaps then you will aware of how wrong this is, this hack designed to make open compounds match hyphenated compounds in english, or whatever it is.

          You are conflating english syntax and word formation into the query parser itself.

          Show
          Robert Muir added a comment - We're conflating high level user syntax and the underlying implementation. Then you have no problem if we form phrase queries for all adjacent english words, like we do for chinese. Perhaps then you will aware of how wrong this is, this hack designed to make open compounds match hyphenated compounds in english, or whatever it is. You are conflating english syntax and word formation into the query parser itself.
          Hide
          Yonik Seeley added a comment -

          bq This is why I like the token attr based solution

          +1

          Although I think it's more general than "de-compounding".
          An attribute that says "these tokens go together" or "these tokens should be considered one unit" seems like nice generic functionality, and is unrelated to any specific language or search feature.

          Show
          Yonik Seeley added a comment - bq This is why I like the token attr based solution +1 Although I think it's more general than "de-compounding". An attribute that says "these tokens go together" or "these tokens should be considered one unit" seems like nice generic functionality, and is unrelated to any specific language or search feature.
          Hide
          Yonik Seeley added a comment -

          Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says.

          We're conflating high level user syntax and the underlying implementation.

          'text:Ready' says "search for the word 'ready' in the field 'text'"... the fact that an underlying term query of 'text:readi' (after lowercasing, stemming, etc) is not incorrect, it's simply the closest match to what the user is asking for given the details of analysis. Likewise, a user query of 'text:ak-47' may end up as a phrase query of "ak 47" because that's the closest representation in the index (the user doesn't necessarily know that the analysis of the field splits on dashes).

          Likewise, a user query of text:"foo bar" is a high level way of saying "search for the word foo immediately followed by the word bar". It is not saying "make a Lucene phrase query object with 2 terms". Synonyms, common grams, or other analysis methods may in fact turn this into a single term query.

          Show
          Yonik Seeley added a comment - Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says. We're conflating high level user syntax and the underlying implementation. 'text:Ready' says "search for the word 'ready' in the field 'text'"... the fact that an underlying term query of 'text:readi' (after lowercasing, stemming, etc) is not incorrect, it's simply the closest match to what the user is asking for given the details of analysis. Likewise, a user query of 'text:ak-47' may end up as a phrase query of "ak 47" because that's the closest representation in the index (the user doesn't necessarily know that the analysis of the field splits on dashes). Likewise, a user query of text:"foo bar" is a high level way of saying "search for the word foo immediately followed by the word bar". It is not saying "make a Lucene phrase query object with 2 terms". Synonyms, common grams, or other analysis methods may in fact turn this into a single term query.
          Hide
          Robert Muir added a comment -

          When StandardAnalyzer splits a part-number-like token, it should do so as well.

          I don't think StandardAnalyzer should do any such thing. Maybe in some screwed up search engine biased towards english and analyzers have to work around it, then EnglishAnalyzer would do this, but not StandardAnalyzer.

          And now you see why this is no solution at all, we will only then end up arguing about the toggle for this aweful hack in more places!

          Instead, the tokenizer used for English should tokenize English better, rather than hacking the entire search engine around it.

          Show
          Robert Muir added a comment - When StandardAnalyzer splits a part-number-like token, it should do so as well. I don't think StandardAnalyzer should do any such thing. Maybe in some screwed up search engine biased towards english and analyzers have to work around it, then EnglishAnalyzer would do this, but not StandardAnalyzer. And now you see why this is no solution at all, we will only then end up arguing about the toggle for this aweful hack in more places! Instead, the tokenizer used for English should tokenize English better, rather than hacking the entire search engine around it.
          Hide
          Uwe Schindler added a comment - - edited

          Sorry for intervening,

          I am in the same opinion like Hoss:
          A lot of people are common to be able to create phrases in search engines by appending words with dashes (which StandardAnalyzer is perfectly doing with the current query parser impl). As quotes are slower to write, I e.g. always use this approach to search for phrases in Google this-is-a-phrase, which works always and brings identical results like "this is a phrase" (only ranking is sometimes slightly different in Google).

          So we should have at least some possibility to switch the behavior on that creates phrase queries out of multiple tokens with posIncr>0 – but I am +1 on fixing the problem for non-whitespace languages like cjk. Its also broken, that QueryParser parses whitespace in its javacc grammar, in my opinion, this should be done by the analyzer (and not partly by analyzer and QP grammar).

          In addition: I just bring in again non-compounds like product ids...

          Show
          Uwe Schindler added a comment - - edited Sorry for intervening, I am in the same opinion like Hoss: A lot of people are common to be able to create phrases in search engines by appending words with dashes (which StandardAnalyzer is perfectly doing with the current query parser impl). As quotes are slower to write, I e.g. always use this approach to search for phrases in Google this-is-a-phrase, which works always and brings identical results like "this is a phrase" (only ranking is sometimes slightly different in Google). So we should have at least some possibility to switch the behavior on that creates phrase queries out of multiple tokens with posIncr>0 – but I am +1 on fixing the problem for non-whitespace languages like cjk. Its also broken, that QueryParser parses whitespace in its javacc grammar, in my opinion, this should be done by the analyzer (and not partly by analyzer and QP grammar). In addition: I just bring in again non-compounds like product ids...
          Hide
          Robert Muir added a comment -

          This is why I like the token attr based solution

          I am, and will always be, -1 to this solution. Why can't we try to think about lucene from a proper internationalization architecture perspective?

          You shouldnt design apis around "e-mail" phenomena in english, thats absurd.

          BTW, this appears to not be an English-only need; this page
          (http://www.seobythesea.com/?p=1206) lists these example languages as
          also using "English-like" compound words: "Some example languages that
          use compound words include: Afrikaans, Danish, Dutch-Flemish, English,
          Faroese, Frisian, High German, Gutnish, Icelandic, Low German,
          Norwegian, Swedish, and Yiddish."

          Please don't try to insinuate that phrases are the way you should handle compound terms in these languages unless you have some actual evidence that they should be used instead of "normal decompounding".

          These languages have different syntax and word formation, and its simply not appropriate.

          Show
          Robert Muir added a comment - This is why I like the token attr based solution I am, and will always be, -1 to this solution. Why can't we try to think about lucene from a proper internationalization architecture perspective? You shouldnt design apis around "e-mail" phenomena in english, thats absurd. BTW, this appears to not be an English-only need; this page ( http://www.seobythesea.com/?p=1206 ) lists these example languages as also using "English-like" compound words: "Some example languages that use compound words include: Afrikaans, Danish, Dutch-Flemish, English, Faroese, Frisian, High German, Gutnish, Icelandic, Low German, Norwegian, Swedish, and Yiddish." Please don't try to insinuate that phrases are the way you should handle compound terms in these languages unless you have some actual evidence that they should be used instead of "normal decompounding". These languages have different syntax and word formation, and its simply not appropriate.
          Hide
          Michael McCandless added a comment -

          I'd like a solution that lets us have our cake and eat it too...

          Ie, we clearly have to fix the disastrous out-of-the-box experience
          that non-whitespace languages (CJK) now have with Lucene. This is
          clear.

          But, when an analyzer that splits English-like compound words (eg
          e-mail -> e mail) is used, I think this should also continue to create
          a PhraseQuery, out-of-the-box.

          Today when a user searches for "e-mail", s/he will correctly see only
          "email/e-mail" hit & highlighted in the search results. If we break
          this behaviour, ie no longer produce a PQ out-of-the-box, suddenly
          hits with just "mail" will be returned, which is bad.

          So a single setter on QueryParser w/ a global default is not a good
          enough solution – it means either CJK or English-like compound words
          will be bad.

          This is why I like the token attr based solution – those analyzers
          that are doing "English-like" de-compounding can mark the tokens as
          such. Then QueryParser can notice this attr and (if configured to do so, via
          setter), create a PhraseQuery out of that sequence of tokens.

          This then pushes the decision of which series of Tokens are produced
          via "English-like" de-compounding. EG I think WordDelimiterFilter
          should be default mark its tokens as such (the majority of users use
          it this way). When StandardAnalyzer splits a part-number-like token,
          it should do so as well.

          This isn't a perfect solution: it's not easy, in general, for an
          analyzer to "know" its splits are "English-like" de-compounding, but
          this would still give us a solid step forward (progress not
          perfection). And, since the decision point is now in the analyzer,
          per-token, it gives users complete flexibility to customize as needed.

          BTW, this appears to not be an English-only need; this page
          (http://www.seobythesea.com/?p=1206) lists these example languages as
          also using "English-like" compound words: "Some example languages that
          use compound words include: Afrikaans, Danish, Dutch-Flemish, English,
          Faroese, Frisian, High German, Gutnish, Icelandic, Low German,
          Norwegian, Swedish, and Yiddish."

          Show
          Michael McCandless added a comment - I'd like a solution that lets us have our cake and eat it too... Ie, we clearly have to fix the disastrous out-of-the-box experience that non-whitespace languages (CJK) now have with Lucene. This is clear. But, when an analyzer that splits English-like compound words (eg e-mail -> e mail) is used, I think this should also continue to create a PhraseQuery, out-of-the-box. Today when a user searches for "e-mail", s/he will correctly see only "email/e-mail" hit & highlighted in the search results. If we break this behaviour, ie no longer produce a PQ out-of-the-box, suddenly hits with just "mail" will be returned, which is bad. So a single setter on QueryParser w/ a global default is not a good enough solution – it means either CJK or English-like compound words will be bad. This is why I like the token attr based solution – those analyzers that are doing "English-like" de-compounding can mark the tokens as such. Then QueryParser can notice this attr and (if configured to do so, via setter), create a PhraseQuery out of that sequence of tokens. This then pushes the decision of which series of Tokens are produced via "English-like" de-compounding. EG I think WordDelimiterFilter should be default mark its tokens as such (the majority of users use it this way). When StandardAnalyzer splits a part-number-like token, it should do so as well. This isn't a perfect solution: it's not easy, in general, for an analyzer to "know" its splits are "English-like" de-compounding, but this would still give us a solid step forward (progress not perfection). And, since the decision point is now in the analyzer, per-token, it gives users complete flexibility to customize as needed. BTW, this appears to not be an English-only need; this page ( http://www.seobythesea.com/?p=1206 ) lists these example languages as also using "English-like" compound words: "Some example languages that use compound words include: Afrikaans, Danish, Dutch-Flemish, English, Faroese, Frisian, High German, Gutnish, Icelandic, Low German, Norwegian, Swedish, and Yiddish."
          Hide
          DM Smith added a comment -

          As I see it there are two issues:
          1) Backward compatibility.
          2) Correctness according to the syntax definition of a query.

          Let me preface the following by saying I have not studied the query parser in Lucene. Over 20 years ago I got an MS in compiler writing. I've been away from it for quite a while.

          So, IMHO as a former compiler writer:

          Maybe I'm just not "getting it" but it should be trivial to define the grammar (w/ precedence for any ambiguity, if necessary) and implement it. The tokenizer for the parser should have the responsibility to break the input into sequences of meta and non-meta. This tokenizer should not be anything more than what the parser requires.

          The non-meta reasonably is subject to further tokenization/analysis. This further analysis should be entirely under the user's control. It should not be part of the parser.

          Regarding the issue, I think it would be best if a quotation was the sole criteria for the determination of what is a phrase, not some heuristical analysis of the token stream.

          Show
          DM Smith added a comment - As I see it there are two issues: 1) Backward compatibility. 2) Correctness according to the syntax definition of a query. Let me preface the following by saying I have not studied the query parser in Lucene. Over 20 years ago I got an MS in compiler writing. I've been away from it for quite a while. So, IMHO as a former compiler writer: Maybe I'm just not "getting it" but it should be trivial to define the grammar (w/ precedence for any ambiguity, if necessary) and implement it. The tokenizer for the parser should have the responsibility to break the input into sequences of meta and non-meta. This tokenizer should not be anything more than what the parser requires. The non-meta reasonably is subject to further tokenization/analysis. This further analysis should be entirely under the user's control. It should not be part of the parser. Regarding the issue, I think it would be best if a quotation was the sole criteria for the determination of what is a phrase, not some heuristical analysis of the token stream.
          Hide
          Robert Muir added a comment -

          but all other things being equal lets keep the behavior working the way it has to avoid suprises.

          This attitude makes me sick. The surprise is to the CJK users that get no results due to undocumented, english-specific hacks that people refuse to let go of.

          Show
          Robert Muir added a comment - but all other things being equal lets keep the behavior working the way it has to avoid suprises. This attitude makes me sick. The surprise is to the CJK users that get no results due to undocumented, english-specific hacks that people refuse to let go of.
          Hide
          Hoss Man added a comment -

          Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says.

          i'll grant you that the documentation is wrong – i view that as a bug in the documentation, not the code.

          Saying that a PhraseQuery object should only ever be constructed if the user uses quotes is like saying that a BooleanQuery should only ever be constructed if the user specifies boolean operators – there is no rule that the syntax must match the query structure, the same Query classes can serve multiple purposes. The parser syntax should be what makes sense for hte user, and the query structure constructed should be what makes sense for hte index, based on the syntax used by the user.

          If i have built an index consisting entirely of ngrams, the end user shouldn't have to know that – they shouldn't have to put every individual word in quotes to force a PhraseQuery to be constructed out of the ngram tokenstream produced by an individual word.

          Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take.

          I personally don't view it as an "english bias" ... to be it is a "backwards compatibility bias"

          I'm totally happy to make things onfigurable, but if two diametrically opposed behaviors are both equally useful, and If there is a choice needs to be made between leaving the default configuration the way the current hardcoded behavior is, or make the default the exact opposite of what the current hardcoded behavior is, it is then i would prefer to leave the default alone – especially since this beahior has been around for so long, and many Analyzers and TOkenFilters, have been written with this behavior specificly in mind (several examples of this are in the Lucene code base – and if we have them in our own code, you can be sure they exist "in the wild" of client code that would break if this behavior changes by default)

          Once again: if this is a problem that can be solved "per instance" with token attributes, then by all means let's make all of the TokenFIlters that come "out of the box" implement this appropriately (english and non-english alike) so that people who change the default settings on the queryparser get the "correct" behavior regardless of langauge. but all other things being equal lets keep the behavior working the way it has to avoid suprises.

          What are some real use-cases where this is "good"?

          • WordDelimiterFilter ("wi-fi" is a pathalogicaly bad example for this issue because as robert pointed out "wi" and "fi" don't tend to exist independently in english, but people tend to get anoyed when "race-horse" matches all docs containing "race" or "horse")
          • single word to multiword synonym expansion/transformation (particularly acronym expansion: GE => General Electric)
          • Ngram indexing for fuzzy matching (if someone searches for the word billionaire they're going to be surprised to get documents containing "lion")
          Show
          Hoss Man added a comment - Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says. i'll grant you that the documentation is wrong – i view that as a bug in the documentation, not the code. Saying that a PhraseQuery object should only ever be constructed if the user uses quotes is like saying that a BooleanQuery should only ever be constructed if the user specifies boolean operators – there is no rule that the syntax must match the query structure, the same Query classes can serve multiple purposes. The parser syntax should be what makes sense for hte user, and the query structure constructed should be what makes sense for hte index, based on the syntax used by the user. If i have built an index consisting entirely of ngrams, the end user shouldn't have to know that – they shouldn't have to put every individual word in quotes to force a PhraseQuery to be constructed out of the ngram tokenstream produced by an individual word. Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take. I personally don't view it as an "english bias" ... to be it is a "backwards compatibility bias" I'm totally happy to make things onfigurable, but if two diametrically opposed behaviors are both equally useful, and If there is a choice needs to be made between leaving the default configuration the way the current hardcoded behavior is, or make the default the exact opposite of what the current hardcoded behavior is, it is then i would prefer to leave the default alone – especially since this beahior has been around for so long, and many Analyzers and TOkenFilters, have been written with this behavior specificly in mind (several examples of this are in the Lucene code base – and if we have them in our own code, you can be sure they exist "in the wild" of client code that would break if this behavior changes by default) Once again: if this is a problem that can be solved "per instance" with token attributes, then by all means let's make all of the TokenFIlters that come "out of the box" implement this appropriately (english and non-english alike) so that people who change the default settings on the queryparser get the "correct" behavior regardless of langauge. but all other things being equal lets keep the behavior working the way it has to avoid suprises. What are some real use-cases where this is "good"? WordDelimiterFilter ("wi-fi" is a pathalogicaly bad example for this issue because as robert pointed out "wi" and "fi" don't tend to exist independently in english, but people tend to get anoyed when "race-horse" matches all docs containing "race" or "horse") single word to multiword synonym expansion/transformation (particularly acronym expansion: GE => General Electric) Ngram indexing for fuzzy matching (if someone searches for the word billionaire they're going to be surprised to get documents containing "lion")
          Hide
          Robert Muir added a comment -

          Change the initial split on whitespace to be customizable. Override the
          splitting behavior for non-whitespace-delimited languages and feed
          getFieldQuery() smaller chunks.

          Whitespace doesn't separate words in the majority of the world's languages, including english.

          The responsibility should be instead on english to do its language-specific processing, not on everyone else to dodge it.

          Show
          Robert Muir added a comment - Change the initial split on whitespace to be customizable. Override the splitting behavior for non-whitespace-delimited languages and feed getFieldQuery() smaller chunks. Whitespace doesn't separate words in the majority of the world's languages, including english. The responsibility should be instead on english to do its language-specific processing, not on everyone else to dodge it.
          Hide
          Marvin Humphrey added a comment -

          > I'm honestly having a tough time seeing where to proceed on this issue.

          Change the initial split on whitespace to be customizable. Override the
          splitting behavior for non-whitespace-delimited languages and feed
          getFieldQuery() smaller chunks.

          That solves your problem without removing behavior most people believe to be
          helpful. Insisting on that orthogonal change is what is holding things up.

          Show
          Marvin Humphrey added a comment - > I'm honestly having a tough time seeing where to proceed on this issue. Change the initial split on whitespace to be customizable. Override the splitting behavior for non-whitespace-delimited languages and feed getFieldQuery() smaller chunks. That solves your problem without removing behavior most people believe to be helpful. Insisting on that orthogonal change is what is holding things up.
          Hide
          Ivan Provalov added a comment -

          Robert has asked me to post our test results on the Chinese Collection. We used the following data collection from TREC:

          http://trec.nist.gov/data/qrels_noneng/index.html
          qrels.trec6.29-54.chinese.gz
          qrels.1-28.chinese.gz

          http://trec.nist.gov/data/topics_noneng
          TREC-6 Chinese topics (.gz)
          TREC-5 Chinese topics (.gz)

          Mandarin Data Collection
          http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52

          Analyzer Name Plain analyzers Added PositionFilter (only at query time)
          ChineseAnalyzer 0.028 0.264
          CJKAnalyzer 0.027 0.284
          SmartChinese 0.027 0.265
          IKAnalyzer 0.028 0.259

          (Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average precision)

          Thanks,

          Ivan Provalov

          Show
          Ivan Provalov added a comment - Robert has asked me to post our test results on the Chinese Collection. We used the following data collection from TREC: http://trec.nist.gov/data/qrels_noneng/index.html qrels.trec6.29-54.chinese.gz qrels.1-28.chinese.gz http://trec.nist.gov/data/topics_noneng TREC-6 Chinese topics (.gz) TREC-5 Chinese topics (.gz) Mandarin Data Collection http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52 Analyzer Name Plain analyzers Added PositionFilter (only at query time) ChineseAnalyzer 0.028 0.264 CJKAnalyzer 0.027 0.284 SmartChinese 0.027 0.265 IKAnalyzer 0.028 0.259 (Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average precision) Thanks, Ivan Provalov
          Hide
          Robert Muir added a comment - - edited

          edit: s/control/contrib. I apologize for the typo.

          As described in another recent
          thread, this allows a search for 'example.com' to match a document which
          contains the URL 'http://www.example.com/index.html'. It would suck if all of
          a sudden a search for 'example.com' started matching every document that
          contained 'com'.

          You could solve this with better analysis, for example recognizing the full URL and decomposing it into its parts (forming n-grams of them).
          This would be more performant than the current "english hacking" anyway.

          I'm honestly having a tough time seeing where to proceed on this issue.

          Lucene's queryparsing is completely broken for several languages due to this bug, and such language-specific hacking (heuristically forming phrase queries based on things that people subjectively feel helps for english) really doesn't belong in core lucene, but instead elsewhere, perhaps in some special optional pass to the contrib query parser.

          The queryparser really should be language-independent and work well on average, this would fix it for several languages.

          However, given the huge english bias I see here, i have a tough time seeing what concrete direction (e.g. code) i can work on to try to fix it. I feel such work would only be rejected since so many people seem opposed to simplifying the query parser and removing this language-specific hack.

          If someone brings up an issue with the query parser (for instance i brought up several language-specific problems at apachecon), then people are quick to say that this doesn't belong in the queryparser, but should be dealt with on a special case. Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take.

          Show
          Robert Muir added a comment - - edited edit: s/control/contrib. I apologize for the typo. As described in another recent thread, this allows a search for 'example.com' to match a document which contains the URL 'http://www.example.com/index.html'. It would suck if all of a sudden a search for 'example.com' started matching every document that contained 'com'. You could solve this with better analysis, for example recognizing the full URL and decomposing it into its parts (forming n-grams of them). This would be more performant than the current "english hacking" anyway. I'm honestly having a tough time seeing where to proceed on this issue. Lucene's queryparsing is completely broken for several languages due to this bug, and such language-specific hacking (heuristically forming phrase queries based on things that people subjectively feel helps for english) really doesn't belong in core lucene, but instead elsewhere, perhaps in some special optional pass to the contrib query parser. The queryparser really should be language-independent and work well on average, this would fix it for several languages. However, given the huge english bias I see here, i have a tough time seeing what concrete direction (e.g. code) i can work on to try to fix it. I feel such work would only be rejected since so many people seem opposed to simplifying the query parser and removing this language-specific hack. If someone brings up an issue with the query parser (for instance i brought up several language-specific problems at apachecon), then people are quick to say that this doesn't belong in the queryparser, but should be dealt with on a special case. Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take.
          Hide
          Marvin Humphrey added a comment -

          > Because they show its 10x better to use this operator for Chinese

          Another way to achieve this 10x improvement is to change how QP performs its
          first stage of tokenization, as you and I discussed at ApacheCon Oakland.

          Right now QP splits on whitespace. If that behavior were customizable, e.g.
          via a "splitter" Analyzer, then individual Han characters would get submitted
          to getFieldQuery() – and thus getFieldQuery() would no longer turn long
          strings of Han characters into a PhraseQuery. It seems wrong to continue to
          push entire query strings from non-whitespace-delimited languages down into
          getFieldQuery().

          Show
          Marvin Humphrey added a comment - > Because they show its 10x better to use this operator for Chinese Another way to achieve this 10x improvement is to change how QP performs its first stage of tokenization, as you and I discussed at ApacheCon Oakland. Right now QP splits on whitespace. If that behavior were customizable, e.g. via a "splitter" Analyzer, then individual Han characters would get submitted to getFieldQuery() – and thus getFieldQuery() would no longer turn long strings of Han characters into a PhraseQuery. It seems wrong to continue to push entire query strings from non-whitespace-delimited languages down into getFieldQuery().
          Hide
          Marvin Humphrey added a comment -

          I have mixed feelings about this for English. It's a weakness of our engine
          that we do not take position of terms within a query string into account. At
          times I've tried to modify the scoring hierarchy to improve the situation, but
          I gave up because it was too difficult. This behavior of QueryParser is a
          sneaky way of getting around that limitation by turning stuff which should
          almost certainly be treated as phrase queries as such. It's the one place
          where we actually exploit position data within the query string.

          Mike's "wi-fi" example, though, wouldn't suffer that badly. The terms "wi"
          and "fi" are unlikely to occur much outside the context of 'wi-fi/wi fi/wifi'.
          And treating "wi-fi" as a phrase still won't conflate results with "wifi" as
          it would ideally.

          The example I would use doesn't typically apply to Lucene. Lucene's
          StandardAnalyzer tokenizes URLs as wholes, but KinoSearch's analogous analyzer
          breaks them up into individual components. As described in another recent
          thread, this allows a search for 'example.com' to match a document which
          contains the URL 'http://www.example.com/index.html'. It would suck if all of
          a sudden a search for 'example.com' started matching every document that
          contained 'com'.

          You could, and theoretically should, address this problem with sophisticated
          analysis. But it does make it harder to write a good Analyzer. You make it
          more important to solve what Yonik calls the 'e space mail' problem by making
          it worse.

          Show
          Marvin Humphrey added a comment - I have mixed feelings about this for English. It's a weakness of our engine that we do not take position of terms within a query string into account. At times I've tried to modify the scoring hierarchy to improve the situation, but I gave up because it was too difficult. This behavior of QueryParser is a sneaky way of getting around that limitation by turning stuff which should almost certainly be treated as phrase queries as such. It's the one place where we actually exploit position data within the query string. Mike's "wi-fi" example, though, wouldn't suffer that badly. The terms "wi" and "fi" are unlikely to occur much outside the context of 'wi-fi/wi fi/wifi'. And treating "wi-fi" as a phrase still won't conflate results with "wifi" as it would ideally. The example I would use doesn't typically apply to Lucene. Lucene's StandardAnalyzer tokenizes URLs as wholes, but KinoSearch's analogous analyzer breaks them up into individual components. As described in another recent thread, this allows a search for 'example.com' to match a document which contains the URL 'http://www.example.com/index.html'. It would suck if all of a sudden a search for 'example.com' started matching every document that contained 'com'. You could, and theoretically should, address this problem with sophisticated analysis. But it does make it harder to write a good Analyzer. You make it more important to solve what Yonik calls the 'e space mail' problem by making it worse.
          Hide
          Robert Muir added a comment -

          What are some real use-cases where this is "good"? WordDelmiterFilter seems like a good example (eg, Wi-Fi -> Wi Fi).
          It sounds like it's a very bad default for non-whitespace languages.

          Its a horrible bug! And to boot, i don't think it helps english much as a default either.
          Here's a comparison on an english test collection (Telegraph collection with standardAnalyzer + porter):

          measure T TD TDN
          % of queries affected 6% 14% 32%
          positionfilter improvement +1.704% +0.213% +0.805%

          So, turning it off certainly doesn't hurt (I won't try to argue that this small "improvement" by turning it off means anything).
          For chinese, its a 10x improvement on TREC5/TREC6: obviously the bug is horrible there because its generating phrase queries all the time.

          This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)? Could/should it be per-analyzer? (ie, WDF would always do this but, say, ICUAnalyzer would never). Or, per-token created?

          I am strongly opposed to this. My tibetan example with WDF or whatever above is an easy example.
          I haven't seen any measured real-world example where this helps, subjectively saying "I like this bug" isnt convincing me.

          We don't need to push "what should be phrase query" onto analysis, it doesn't know from unicode properties etc, what the user wanted.
          We don't need to put hairy logic into things like StandardTokenizer, to determine if "the user wanted a phrase query" or not in certain contexts.

          Instead we should just do what the documentation says, and only issue phrase queries when the user asks for one!!!!!!

          Show
          Robert Muir added a comment - What are some real use-cases where this is "good"? WordDelmiterFilter seems like a good example (eg, Wi-Fi -> Wi Fi). It sounds like it's a very bad default for non-whitespace languages. Its a horrible bug! And to boot, i don't think it helps english much as a default either. Here's a comparison on an english test collection (Telegraph collection with standardAnalyzer + porter): measure T TD TDN % of queries affected 6% 14% 32% positionfilter improvement +1.704% +0.213% +0.805% So, turning it off certainly doesn't hurt (I won't try to argue that this small "improvement" by turning it off means anything). For chinese, its a 10x improvement on TREC5/TREC6: obviously the bug is horrible there because its generating phrase queries all the time. This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)? Could/should it be per-analyzer? (ie, WDF would always do this but, say, ICUAnalyzer would never). Or, per-token created? I am strongly opposed to this. My tibetan example with WDF or whatever above is an easy example. I haven't seen any measured real-world example where this helps, subjectively saying "I like this bug" isnt convincing me. We don't need to push "what should be phrase query" onto analysis, it doesn't know from unicode properties etc, what the user wanted. We don't need to put hairy logic into things like StandardTokenizer, to determine if "the user wanted a phrase query" or not in certain contexts. Instead we should just do what the documentation says, and only issue phrase queries when the user asks for one!!!!!!
          Hide
          Michael McCandless added a comment -

          This is sneaky behavior on QueryParser's part! I didn't realize it did this.

          What are some real use-cases where this is "good"? WordDelmiterFilter seems like a good example (eg, Wi-Fi -> Wi Fi).

          It sounds like it's a very bad default for non-whitespace languages.

          It seems like we should make it controllable, switch it under Version, and change the default going forward to not do this?

          Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens?

          This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)? Could/should it be per-analyzer? (ie, WDF would always do this but, say, ICUAnalyzer would never). Or, per-token created?

          Show
          Michael McCandless added a comment - This is sneaky behavior on QueryParser's part! I didn't realize it did this. What are some real use-cases where this is "good"? WordDelmiterFilter seems like a good example (eg, Wi-Fi -> Wi Fi). It sounds like it's a very bad default for non-whitespace languages. It seems like we should make it controllable, switch it under Version, and change the default going forward to not do this? Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens? This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)? Could/should it be per-analyzer? (ie, WDF would always do this but, say, ICUAnalyzer would never). Or, per-token created?
          Hide
          Robert Muir added a comment -

          by the way hoss man you said it best yourself:

          lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single "word" and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...)

          User clearly intended is wrong. WordDelimiterFilter will break tibetan text in a similar manner (it uses no spaces between words), yet no user "clearly intended" to form phrase queries.

          Users clearly intend to form phrase queries only when they use the phrase query operator, thats how the query parser is documented to work, and its a bug that it doesnt work that way.

          Show
          Robert Muir added a comment - by the way hoss man you said it best yourself: lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single "word" and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...) User clearly intended is wrong. WordDelimiterFilter will break tibetan text in a similar manner (it uses no spaces between words), yet no user "clearly intended" to form phrase queries. Users clearly intend to form phrase queries only when they use the phrase query operator, thats how the query parser is documented to work, and its a bug that it doesnt work that way.
          Hide
          Robert Muir added a comment -

          That seems like equally bad default behavior

          Do you have measurements to support this? Because they show its 10x better to use this operator for Chinese

          I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes

          I disagree. Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says.

          Show
          Robert Muir added a comment - That seems like equally bad default behavior Do you have measurements to support this? Because they show its 10x better to use this operator for Chinese I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes I disagree. Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says.
          Hide
          Hoss Man added a comment -

          a Boolean Query formed with the default operator.

          That seems like equally bad default behavior – lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single "word" and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...)

          I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens? A new Attribute could be used on each token to convey when that token should be combined with teh previous token, and in what way: as a phrase, as a conjunction or as a disjunction. (this could still be orthogonal to the position, which would indicate slop/span type information like it does currently)

          Stock Analysys components that produce multiple tokens could be modified to add this attribute fairly easily (it should be a relatively static value for any component that currently "splits" tokens) and QueryParser could have an option controlling what to do if it encounters a token w/o this attribute (perhaps even two options: one for quoted input chunks and one for unquoted input chunks).

          that way the default could still work in a back compatible way, but people using languages that don't use whitespace separation and are using older (or custom) analyzers that don't know about this attribute could set a simple query parser property to force this behavior.

          would that make sense? (asks the man who only vaguely understands Token Attributes at this point)

          Show
          Hoss Man added a comment - a Boolean Query formed with the default operator. That seems like equally bad default behavior – lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single "word" and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...) I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens? A new Attribute could be used on each token to convey when that token should be combined with teh previous token, and in what way: as a phrase, as a conjunction or as a disjunction. (this could still be orthogonal to the position, which would indicate slop/span type information like it does currently) Stock Analysys components that produce multiple tokens could be modified to add this attribute fairly easily (it should be a relatively static value for any component that currently "splits" tokens) and QueryParser could have an option controlling what to do if it encounters a token w/o this attribute (perhaps even two options: one for quoted input chunks and one for unquoted input chunks). that way the default could still work in a back compatible way, but people using languages that don't use whitespace separation and are using older (or custom) analyzers that don't know about this attribute could set a simple query parser property to force this behavior. would that make sense? (asks the man who only vaguely understands Token Attributes at this point)
          Hide
          Robert Muir added a comment -

          ...what should the resulting Query object be?

          a Boolean Query formed with the default operator.

          Show
          Robert Muir added a comment - ...what should the resulting Query object be? a Boolean Query formed with the default operator.
          Hide
          Hoss Man added a comment -

          Robter: do you have a specific suggestion for what QueryParser should do if a single "chunk" of input causes the Analyzer to produce multiple tokens that are not at the same position (ie: the current case where QueryParser produces a PhraseQuery even if there are no quotes)

          Ie: if the query parser is asked to parse...

          fieldName:A-Field-Value

          ...and the Analyzer produces three tokens...

          • A (at position 0)
          • Field (at position 1)
          • Value (at position 2)

          ...what should the resulting Query object be?

          Show
          Hoss Man added a comment - Robter: do you have a specific suggestion for what QueryParser should do if a single "chunk" of input causes the Analyzer to produce multiple tokens that are not at the same position (ie: the current case where QueryParser produces a PhraseQuery even if there are no quotes) Ie: if the query parser is asked to parse... fieldName:A-Field-Value ...and the Analyzer produces three tokens... A (at position 0) Field (at position 1) Value (at position 2) ...what should the resulting Query object be?

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development