Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1
    • Component/s: None
    • Labels:
      None

      Description

      An improved user-facing query parser based on dismax

      1. SOLR-1553.pf-refactor.patch
        8 kB
        Hoss Man
      2. SOLR-1553.patch
        53 kB
        Yonik Seeley
      3. edismax.userFields.patch
        3 kB
        Hoss Man
      4. edismax.unescapedcolon.bug.test.patch
        2 kB
        Hoss Man
      5. edismax.unescapedcolon.bug.test.patch
        4 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Hide
          Ryan McKinley added a comment -

          the 'experimental' label is a flag to say that the behavior will likely change in the future – since back compatibility is taken so seriously, this allows a way to add features before they are 100% cooked.

          Show
          Ryan McKinley added a comment - the 'experimental' label is a flag to say that the behavior will likely change in the future – since back compatibility is taken so seriously, this allows a way to add features before they are 100% cooked.
          Hide
          Hoss Man added a comment -

          I'm confused about why this cool query parser I've been using is "experimental"

          because some of it's current "default" behavior is less then ideal, particularly for people migrating from dismax (ie: see comments about making field queries configurable) and in a few cases even "broken" compared to how it worked when the patch was initially commited (see recent comments about foo:bar when foo is not a field)

          in general, marking it experimental is a way to allow us to leave it in the 3.1 release but still have the flexibility to modify the "default" behavior moving forward.

          Show
          Hoss Man added a comment - I'm confused about why this cool query parser I've been using is "experimental" because some of it's current "default" behavior is less then ideal, particularly for people migrating from dismax (ie: see comments about making field queries configurable) and in a few cases even "broken" compared to how it worked when the patch was initially commited (see recent comments about foo:bar when foo is not a field) in general, marking it experimental is a way to allow us to leave it in the 3.1 release but still have the flexibility to modify the "default" behavior moving forward.
          Hide
          David Smiley added a comment -

          I'm confused about why this cool query parser I've been using is "experimental". Sure, there are opportunities for improvement, but it's already better than the original dismax which this makes obsolete. No?

          Show
          David Smiley added a comment - I'm confused about why this cool query parser I've been using is "experimental". Sure, there are opportunities for improvement, but it's already better than the original dismax which this makes obsolete. No?
          Hide
          Yonik Seeley added a comment -

          Resolving. Improvements can be tracked in a new issue.

          Show
          Yonik Seeley added a comment - Resolving. Improvements can be tracked in a new issue.
          Hide
          Hoss Man added a comment -

          I'll keep the issue open in 3.1 for a few more days as discussed, then i'm moving it out.

          it would be less confusing to just resolve it as fixed, and open new issues to track the outstanding problems/bugs/questions.

          Show
          Hoss Man added a comment - I'll keep the issue open in 3.1 for a few more days as discussed, then i'm moving it out. it would be less confusing to just resolve it as fixed, and open new issues to track the outstanding problems/bugs/questions.
          Hide
          Robert Muir added a comment -

          I marked this experimental in trunk.

          I'll keep the issue open in 3.1 for a few more days as discussed, then i'm moving it out.

          Show
          Robert Muir added a comment - I marked this experimental in trunk. I'll keep the issue open in 3.1 for a few more days as discussed, then i'm moving it out.
          Hide
          Ryan McKinley added a comment -

          +1 to mark as experimental in 3.1


          The innards of how this works it totally greek, but tried finding somethign to fix hoss' unescaped patch. It seems that the root of the problem is that QueryParserBase.parse( String ) will return a BooleanQuery with no clauses for the invalid field query.

          Query res = TopLevelQuery(field);
          return res!=null ? res : newBooleanQuery(false);
          

          Then the edismax just checks if the parsedQuery is null to see if it is valid.

          I tried just returning null from the QueryParserBase, but that (not surprisingly) breaks other tests like TestMultiFieldQueryParser. I imagine somethign changed here for why it used to work, and now "mysteriously" does not.

          Adding a check for empty BooleanQuery fixes this in edismax though:

           if( parsedUserQuery instanceof BooleanQuery ) {
             if( ((BooleanQuery)parsedUserQuery).getClauses().length < 1 ) {
               parsedUserQuery = null;
             }
           }
          

          All tests pass... but can someone who knows what the ramifications of this change means take a look?

          Show
          Ryan McKinley added a comment - +1 to mark as experimental in 3.1 The innards of how this works it totally greek, but tried finding somethign to fix hoss' unescaped patch. It seems that the root of the problem is that QueryParserBase.parse( String ) will return a BooleanQuery with no clauses for the invalid field query. Query res = TopLevelQuery(field); return res!= null ? res : newBooleanQuery( false ); Then the edismax just checks if the parsedQuery is null to see if it is valid. I tried just returning null from the QueryParserBase, but that (not surprisingly) breaks other tests like TestMultiFieldQueryParser. I imagine somethign changed here for why it used to work, and now "mysteriously" does not. Adding a check for empty BooleanQuery fixes this in edismax though: if ( parsedUserQuery instanceof BooleanQuery ) { if ( ((BooleanQuery)parsedUserQuery).getClauses().length < 1 ) { parsedUserQuery = null ; } } All tests pass... but can someone who knows what the ramifications of this change means take a look?
          Hide
          Mark Miller added a comment -

          Hi, can we just mark this stuff experimental for 3.1?

          +1

          Show
          Mark Miller added a comment - Hi, can we just mark this stuff experimental for 3.1? +1
          Hide
          Robert Muir added a comment -

          Hi, can we just mark this stuff experimental for 3.1?

          If nobody takes action on this issue within 7 days,
          I plan to add the text to README.txt and move the issue out of 3.1.

          Show
          Robert Muir added a comment - Hi, can we just mark this stuff experimental for 3.1? If nobody takes action on this issue within 7 days, I plan to add the text to README.txt and move the issue out of 3.1.
          Hide
          Mark Miller added a comment -

          Heh - one of the issues blocked by this is already resolved.

          Show
          Mark Miller added a comment - Heh - one of the issues blocked by this is already resolved.
          Hide
          Yonik Seeley added a comment -

          Users do not expect the first query to remove all instances of "foo" from their results. "" meaning "forbidden" should only take effect when the "" is followed by a non-whitespace character.

          Graham, this has just been fixed in trunk: LUCENE-2566

          Show
          Yonik Seeley added a comment - Users do not expect the first query to remove all instances of "foo" from their results. " " meaning "forbidden" should only take effect when the " " is followed by a non-whitespace character. Graham, this has just been fixed in trunk: LUCENE-2566
          Hide
          Hoss Man added a comment -

          Gunnar: this is precisely what i was attempting with the edismax.userFields.patch file but it doesn't work at the moment, probably because of the problem i identified in edismax.unescapedcolon.bug.test.patch (see previous comments, particularly discussion with Jonathan.

          If you've got some time/energy to dig into those patches and help get them working that would be great

          Show
          Hoss Man added a comment - Gunnar: this is precisely what i was attempting with the edismax.userFields.patch file but it doesn't work at the moment, probably because of the problem i identified in edismax.unescapedcolon.bug.test.patch (see previous comments, particularly discussion with Jonathan. If you've got some time/energy to dig into those patches and help get them working that would be great
          Hide
          Gunnar Wagenknecht added a comment - - edited

          Re: field-based queries (field:value)

          I was wonder if it would be possible to add support for explicitly limiting the allowed fields? For example, I'd like to be able to not allow queries in internal fields but only in a defined set of "public" fields (eg. GMail recognizes "to", "subject", etc. but really only those).

          Another nice feature would be alias definitions for fields. This would allow to support localized field names in queries which translate to the real field name. (eg. "an" -> "to", "Betreff" -> "subject", "betreff" -> "subject").

          Thoughts?

          Show
          Gunnar Wagenknecht added a comment - - edited Re: field-based queries (field:value) I was wonder if it would be possible to add support for explicitly limiting the allowed fields? For example, I'd like to be able to not allow queries in internal fields but only in a defined set of "public" fields (eg. GMail recognizes "to", "subject", etc. but really only those). Another nice feature would be alias definitions for fields. This would allow to support localized field names in queries which translate to the real field name. (eg. "an" -> "to", "Betreff" -> "subject", "betreff" -> "subject"). Thoughts?
          Hide
          Lance Norskog added a comment -

          Does edismax handle wildcards, and should it?

          To handle the upper/lower case problem (which extends to character set folding), text and string fields could specify a CharFilter for indexing. Which promptly interacts with faceting because facets come from indexed values, not stored values.

          Show
          Lance Norskog added a comment - Does edismax handle wildcards, and should it? To handle the upper/lower case problem (which extends to character set folding), text and string fields could specify a CharFilter for indexing. Which promptly interacts with faceting because facets come from indexed values, not stored values.
          Hide
          Ron Mayer added a comment -

          I very much like edismax - The "pf2" parameter in particular is doing wonders for getting my most relevant documents to the very top of the list in one of my apps.

          Show
          Ron Mayer added a comment - I very much like edismax - The "pf2" parameter in particular is doing wonders for getting my most relevant documents to the very top of the list in one of my apps.
          Hide
          Yonik Seeley added a comment -

          Does anyone remember why we disable coord for the main query?

          Show
          Yonik Seeley added a comment - Does anyone remember why we disable coord for the main query?
          Hide
          Yonik Seeley added a comment -

          Interesting... edismax does currently treat (bar - foo) like (bar -foo).
          I didn't code that, so it must be the Lucene query parser.
          I wonder if it has done that forever, and if it was intended?

          Show
          Yonik Seeley added a comment - Interesting... edismax does currently treat (bar - foo) like (bar -foo). I didn't code that, so it must be the Lucene query parser. I wonder if it has done that forever, and if it was intended?
          Hide
          Graham P added a comment - - edited

          Please ensure that the the edismax does not have old DisMax bug where a query for

          bar - foo

          gets treated like

          bar -foo

          Users do not expect the first query to remove all instances of "foo" from their results. "-" meaning "forbidden" should only take effect when the "-" is followed by a non-whitespace character. Similarly for +. Same problem as SOLR-490, but I doubt the necessity of interpreting the original query as a phrase query "bar - foo" in order to fix the problem.

          Show
          Graham P added a comment - - edited Please ensure that the the edismax does not have old DisMax bug where a query for bar - foo gets treated like bar -foo Users do not expect the first query to remove all instances of "foo" from their results. "-" meaning "forbidden" should only take effect when the "-" is followed by a non-whitespace character. Similarly for +. Same problem as SOLR-490 , but I doubt the necessity of interpreting the original query as a phrase query "bar - foo" in order to fix the problem.
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Hoss Man added a comment -

          Another bug noticed on the user list, edismax doesn't seem to respect MM all the time – in particular when there are negated clauses...

          Compare...
          http://localhost:8983/solr/select?debugQuery=true&defType=dismax&qf=text&q=xxx+yyy+zzz+-1234&mm=2
          http://localhost:8983/solr/select?debugQuery=true&defType=edismax&qf=text&q=xxx+yyy+zzz+-1234&mm=2

          Show
          Hoss Man added a comment - Another bug noticed on the user list, edismax doesn't seem to respect MM all the time – in particular when there are negated clauses... Compare... http://localhost:8983/solr/select?debugQuery=true&defType=dismax&qf=text&q=xxx+yyy+zzz+-1234&mm=2 http://localhost:8983/solr/select?debugQuery=true&defType=edismax&qf=text&q=xxx+yyy+zzz+-1234&mm=2
          Hide
          Hoss Man added a comment -

          Jonathan: looking at the code, it seems completely plausible, ant that's the direction i was going down with that previous patch – but i got hung up on the fact that for reasons i couldn't identify, clauses refering to fieldnames that don't exist in the schema are getting dropped – need to track down where that is happening and stop it, so the new code can look at those "field names" and treat them as aliases (to resolve to other fields)

          I just need to find time to dig into it more – but if you want to take a swing at fixing edismax.unescapedcolon.bug.test.patch and then improving on edismax.userFields.patch, by all means be my guest.

          Show
          Hoss Man added a comment - Jonathan: looking at the code, it seems completely plausible, ant that's the direction i was going down with that previous patch – but i got hung up on the fact that for reasons i couldn't identify, clauses refering to fieldnames that don't exist in the schema are getting dropped – need to track down where that is happening and stop it, so the new code can look at those "field names" and treat them as aliases (to resolve to other fields) I just need to find time to dig into it more – but if you want to take a swing at fixing edismax.unescapedcolon.bug.test.patch and then improving on edismax.userFields.patch, by all means be my guest.
          Hide
          Jonathan Rochkind added a comment -

          Hoss, I would be EXTREMELY interested in the "uf" ability to specify only certain fields that may be user-specified (with other field-spec-looking things just escaped, I guess?), and also in another comment you made earlier:

          "one really nice thing about the field query support though: it looks like it would really be easy to add support for arbitrary field name aliasing with something like f.someFieldAlias.qf=realFieldA^3+realFieldB^4"

          That would be amazing, and make eDismax meet all my needs. Are you still thinking of that, and does your work with the code make it seem any more or less plausible?

          Show
          Jonathan Rochkind added a comment - Hoss, I would be EXTREMELY interested in the "uf" ability to specify only certain fields that may be user-specified (with other field-spec-looking things just escaped, I guess?), and also in another comment you made earlier: "one really nice thing about the field query support though: it looks like it would really be easy to add support for arbitrary field name aliasing with something like f.someFieldAlias.qf=realFieldA^3+realFieldB^4" That would be amazing, and make eDismax meet all my needs. Are you still thinking of that, and does your work with the code make it seem any more or less plausible?
          Hide
          Yonik Seeley added a comment -

          Hmmm, the intention was to try and detect when a ':' meant a fielded query, or if it was a literal.
          So IIRC, it does look like a bug if it's not working.

          A "uf" param sounds good.

          Show
          Yonik Seeley added a comment - Hmmm, the intention was to try and detect when a ':' meant a fielded query, or if it was a literal. So IIRC, it does look like a bug if it's not working. A "uf" param sounds good.
          Hide
          Hoss Man added a comment -

          What does "u" in "uf" stand for?

          "user fields" ... as in "field names a user may refer to" ... but it's not something i though through to hard, as i said: work in progress.

          Show
          Hoss Man added a comment - What does "u" in "uf" stand for? "user fields" ... as in "field names a user may refer to" ... but it's not something i though through to hard, as i said: work in progress.
          Hide
          Otis Gospodnetic added a comment -

          What does "u" in "uf" stand for?

          Show
          Otis Gospodnetic added a comment - What does "u" in "uf" stand for?
          Hide
          Hoss Man added a comment -

          FWIW: initial steps towards adding a "uf" param to let users specify what field names can be specified explicitly in the query string, with optional default boosts to apply to those clauses ... not finished.

          Show
          Hoss Man added a comment - FWIW: initial steps towards adding a "uf" param to let users specify what field names can be specified explicitly in the query string, with optional default boosts to apply to those clauses ... not finished.
          Hide
          Hoss Man added a comment -

          On the train this past weekend i started trying to tackle the issue of making support for field based queries (ie: "fieldA:valueB") configurable so that it could be turned on/off for certain fields (or left off completely for back-compat with dismax)

          Based on yonik's description of edismax, and my initial reading of the code (particularly the use of clause.field and getFieldName in ExtendedDismaxQParser) i was under the impression that if a "clause" consisting of FOO:BAR was encountered, and FOO was not a known field, that the clause would be treated as a literal, and the colon would be escaped before passing it on to ExtendedSolrQueryParser ... essentially that FOO:BAR and FOO\:BAR would be equivalent if FOO is not the name of a real field according to the IndexSchema.

          For reasons I don't fully understand yet, this isn't the case – as the attached test shows, the queries are parsed differently, and (evidently) FOO:BAR is parsed as an empty query if FOO is not a real field.

          Before I try digging into this too much, I wanted to sanity check:

          • is this expected? ... was this done intentionally?
          • is this desired? ... is this logical default behavior to have if the field isn't defined? should we have tests to assert this before i start adding more config options to change the behavior?
          Show
          Hoss Man added a comment - On the train this past weekend i started trying to tackle the issue of making support for field based queries (ie: "fieldA:valueB") configurable so that it could be turned on/off for certain fields (or left off completely for back-compat with dismax) Based on yonik's description of edismax, and my initial reading of the code (particularly the use of clause.field and getFieldName in ExtendedDismaxQParser) i was under the impression that if a "clause" consisting of FOO:BAR was encountered, and FOO was not a known field, that the clause would be treated as a literal, and the colon would be escaped before passing it on to ExtendedSolrQueryParser ... essentially that FOO:BAR and FOO\:BAR would be equivalent if FOO is not the name of a real field according to the IndexSchema. For reasons I don't fully understand yet, this isn't the case – as the attached test shows, the queries are parsed differently, and (evidently) FOO:BAR is parsed as an empty query if FOO is not a real field. Before I try digging into this too much, I wanted to sanity check: is this expected? ... was this done intentionally? is this desired? ... is this logical default behavior to have if the field isn't defined? should we have tests to assert this before i start adding more config options to change the behavior?
          Hide
          Peter Wolanin added a comment -

          some commented out debug code left in the committed parser?

              protected void addClause(List clauses, int conj, int mods, Query q) {
          //System.out.println("addClause:clauses="+clauses+" conj="+conj+" mods="+mods+" q="+q);
                super.addClause(clauses, conj, mods, q);
              }
          
          Show
          Peter Wolanin added a comment - some commented out debug code left in the committed parser? protected void addClause(List clauses, int conj, int mods, Query q) { // System .out.println( "addClause:clauses=" +clauses+ " conj=" +conj+ " mods=" +mods+ " q=" +q); super .addClause(clauses, conj, mods, q); }
          Hide
          Hoss Man added a comment -

          Committed revision 901342.
          ...
          this was the same as my SOLR-1553.pf-refactor.patch with the one addition of restoring the use of DisjunctionMaxQuery for the pf* params (per yonik's comment that he couldn't remember why he changed it)

          if we figure out his reason (i'm sure he had one) we can re-evaluate.

          Show
          Hoss Man added a comment - Committed revision 901342. ... this was the same as my SOLR-1553 .pf-refactor.patch with the one addition of restoring the use of DisjunctionMaxQuery for the pf* params (per yonik's comment that he couldn't remember why he changed it) if we figure out his reason (i'm sure he had one) we can re-evaluate.
          Hide
          David Smiley added a comment -

          To answer my own question, the bigram phrasing approach is to address shortcomings in a complete phrase boost.

          Show
          David Smiley added a comment - To answer my own question, the bigram phrasing approach is to address shortcomings in a complete phrase boost.
          Hide
          David Smiley added a comment -

          Yonik (or someone else I guess), would you mind commenting on this claim?:
          "Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field."
          I looked through the source and saw the bigram addition using "SHOULD"... cool... but I am unaware of the two problems you speak of that this addresses.

          By the way... this thing is slick. You rock.

          Show
          David Smiley added a comment - Yonik (or someone else I guess), would you mind commenting on this claim?: "Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field." I looked through the source and saw the bigram addition using "SHOULD"... cool... but I am unaware of the two problems you speak of that this addresses. By the way... this thing is slick. You rock.
          Hide
          Hoss Man added a comment -

          incremental patch (against trunk) that i wrote on the plane a few days ago...

          refactors the pf logic and adds comments: pf is restored to it's original meaning in dismax, pf3 does what it did in edismax, and pf2 is added to do what edismax was previously doing with pf.

          clear as mud right? ... fortunately the code is much cleaner then the explanation.

          Show
          Hoss Man added a comment - incremental patch (against trunk) that i wrote on the plane a few days ago... refactors the pf logic and adds comments: pf is restored to it's original meaning in dismax, pf3 does what it did in edismax, and pf2 is added to do what edismax was previously doing with pf. clear as mud right? ... fortunately the code is much cleaner then the explanation.
          Hide
          Yonik Seeley added a comment -

          why is "TO" listed as an operator when building up the phrase boost fields? (line 296) ... if range queries are supported, then shouldn't the upper/lower bounds also be striped out of the clauses list?

          It seemed incrementally better than leaving it out.
          Range queries aren't actually supported in the fallback mode. The phrase boosting is also relatively rudimentary. We could try to recognize range queries and removing the whole thing.

          i see that the boost queries built from the pf and pf3 fields are put in BooleanQueries instead of DisjunctionMaxQueries ... but why?

          I honestly don't recall (this was written a while ago).

          ExtendedAnalyzer feels like a really big hack ... i'm not certain, but i don't think it works correctly if a CharFilter is declared.

          Probably not - CharFilter didn't exist when this was written.
          Should we use an alternative method for signaling that stopwords are optional? The nice thing about putting the stopword filter in the query analyzer (and having edismax detect it) is that something like the lucene query parser works as it did in the past... stopwords are removed. But I could see the other side of the argument too.

          Anyway, I think I agree with pretty much everything you say - just haven't had time to do anything about it.

          Show
          Yonik Seeley added a comment - why is "TO" listed as an operator when building up the phrase boost fields? (line 296) ... if range queries are supported, then shouldn't the upper/lower bounds also be striped out of the clauses list? It seemed incrementally better than leaving it out. Range queries aren't actually supported in the fallback mode. The phrase boosting is also relatively rudimentary. We could try to recognize range queries and removing the whole thing. i see that the boost queries built from the pf and pf3 fields are put in BooleanQueries instead of DisjunctionMaxQueries ... but why? I honestly don't recall (this was written a while ago). ExtendedAnalyzer feels like a really big hack ... i'm not certain, but i don't think it works correctly if a CharFilter is declared. Probably not - CharFilter didn't exist when this was written. Should we use an alternative method for signaling that stopwords are optional? The nice thing about putting the stopword filter in the query analyzer (and having edismax detect it) is that something like the lucene query parser works as it did in the past... stopwords are removed. But I could see the other side of the argument too. Anyway, I think I agree with pretty much everything you say - just haven't had time to do anything about it.
          Hide
          Hoss Man added a comment -

          Thoughts while reading the code...

          • the code is kind of hard to read ... there's a serious dirth of comments
          • reads very kludgy, clearly a hacked up version of DisMax ... probably want to refactor some helper functions (that can then be documented)
          • the clause.field and getFieldName functionality is dangerous for people migrating from edismax->dismax (users guessing field names can query on fields the solr admin doesn't want them to query on) ... we need an option to turn that off.
            • one really nice thing about the field query support though: it looks like it would really be easy to add support for arbitrary field name aliasing with something like f.someFieldAlias.qf=realFieldA^3+realFieldB^4
            • perhaps getFieldName should only work for fields explicitly enumerated in a param?
          • why is "TO" listed as an operator when building up the phrase boost fields? (line 296) ... if range queries are supported, then shouldn't the upper/lower bounds also be striped out of the clauses list?
            • accepting range queries also seems like something that people should be able to disable
          • apparently "pf" was changed to iteratively build boosting phrase queries for every 'pair' of words, and "pf3" is a new param to build boosting phrase queries for every 'triple' of words in the input. while this certainly seems useful, it's not back-compatable .. why not restore 'pf' to it's original purpose, and add "pf2" for hte pairs?
          • what is the motivation for ExtendedSolrQueryParser.makeDismax? ... i see that the boost queries built from the pf and pf3 fields are put in BooleanQueries instead of DisjunctionMaxQueries ... but why? (if the user searches for a phrase that's common in many fields of one document, that document is going to get a huge score boost regardless of the "tie" value, which kind of defeats the point of what the dismax parser is trying to do)
          • we should remove the extremely legacy "/* legacy logic */" for dealing with "bq" ... almost no one should care about that, we really don't need to carry it forward in a new parser.
          • there are a lot of empty catch blocks that seem like they should at least log a warning or debug message.
          • ExtendedAnalyzer feels like a really big hack ... i'm not certain, but i don't think it works correctly if a CharFilter is declared.
          • we need to document all these new params ("pf3", "lowercaseOperators", "boost",

          Thoughts while testing it out on some really hairy edge cases that break the old dismax parser...

          • this is really cool
          • this is really freaking cool.
          • still has a problem with search strings like "foo &&" and "foo ||" ... i suspect it would be an easy fix to recognize these just like AND/OR are recognized and escaped.
          • once we fix some of hte issues mentioned above, we should absolutely register this using the name "dismax" by default, and register the old one as "oldDismax" with a note in CHANGES.txt telling people to use defType=oldDismax if they really need it.
          Show
          Hoss Man added a comment - Thoughts while reading the code... the code is kind of hard to read ... there's a serious dirth of comments reads very kludgy, clearly a hacked up version of DisMax ... probably want to refactor some helper functions (that can then be documented) the clause.field and getFieldName functionality is dangerous for people migrating from edismax->dismax (users guessing field names can query on fields the solr admin doesn't want them to query on) ... we need an option to turn that off. one really nice thing about the field query support though: it looks like it would really be easy to add support for arbitrary field name aliasing with something like f.someFieldAlias.qf=realFieldA^3+realFieldB^4 perhaps getFieldName should only work for fields explicitly enumerated in a param? why is "TO" listed as an operator when building up the phrase boost fields? (line 296) ... if range queries are supported, then shouldn't the upper/lower bounds also be striped out of the clauses list? accepting range queries also seems like something that people should be able to disable apparently "pf" was changed to iteratively build boosting phrase queries for every 'pair' of words, and "pf3" is a new param to build boosting phrase queries for every 'triple' of words in the input. while this certainly seems useful, it's not back-compatable .. why not restore 'pf' to it's original purpose, and add "pf2" for hte pairs? what is the motivation for ExtendedSolrQueryParser.makeDismax? ... i see that the boost queries built from the pf and pf3 fields are put in BooleanQueries instead of DisjunctionMaxQueries ... but why? (if the user searches for a phrase that's common in many fields of one document, that document is going to get a huge score boost regardless of the "tie" value, which kind of defeats the point of what the dismax parser is trying to do) we should remove the extremely legacy "/* legacy logic */" for dealing with "bq" ... almost no one should care about that, we really don't need to carry it forward in a new parser. there are a lot of empty catch blocks that seem like they should at least log a warning or debug message. ExtendedAnalyzer feels like a really big hack ... i'm not certain, but i don't think it works correctly if a CharFilter is declared. we need to document all these new params ("pf3", "lowercaseOperators", "boost", Thoughts while testing it out on some really hairy edge cases that break the old dismax parser... this is really cool this is really freaking cool. still has a problem with search strings like "foo &&" and "foo ||" ... i suspect it would be an easy fix to recognize these just like AND/OR are recognized and escaped. once we fix some of hte issues mentioned above, we should absolutely register this using the name "dismax" by default, and register the old one as "oldDismax" with a note in CHANGES.txt telling people to use defType=oldDismax if they really need it.
          Hide
          Yonik Seeley added a comment -

          committed.
          Leaving open for now, as there are unresolved issues, such as if we should point "dismax" at this new version.

          Show
          Yonik Seeley added a comment - committed. Leaving open for now, as there are unresolved issues, such as if we should point "dismax" at this new version.
          Hide
          Hoss Man added a comment -

          Errr, well... I wrote it pretty quickly.

          You can't possibly have written it any more quickly then i did the original dismax parser ... i mean jusus: was so rushed i forgot to test "AND" in uppercase, and as a result didn't request the time i needed to do it all the right way.

          i think we've all suffered from my poor initial implementation long enough.

          Show
          Hoss Man added a comment - Errr, well... I wrote it pretty quickly. You can't possibly have written it any more quickly then i did the original dismax parser ... i mean jusus: was so rushed i forgot to test "AND" in uppercase, and as a result didn't request the time i needed to do it all the right way. i think we've all suffered from my poor initial implementation long enough.
          Hide
          Yonik Seeley added a comment -

          Assuming the new parser is well written (and i would expect nothing less from all you lucid folks)

          Errr, well... I wrote it pretty quickly. Someone was working on a better query parser and this one was more of a "we need something now" type of thing.

          But yes, it should be easy to turn off the extra features.

          Show
          Yonik Seeley added a comment - Assuming the new parser is well written (and i would expect nothing less from all you lucid folks) Errr, well... I wrote it pretty quickly. Someone was working on a better query parser and this one was more of a "we need something now" type of thing. But yes, it should be easy to turn off the extra features.
          Hide
          Hoss Man added a comment -

          I haven't looked at the patch, and i hope to find some time to do so prior to year with a "3" in it, but in the meantime...

          I guess one of the main questions is if this should remain separate from dismax, or if it should eventually replace dismax?

          I think the right way to handle this is to let both classes coexist in the code base, but to start encouraging people to use the new/improved version by changing the default used when defType=dismax ... a note in CHANGES.txt that people who really want the old class should use defType=oldDismax, or explicitly declare the old parsing using the name "dismax" seems totally suitable (much as we did with LegacyDateField, and lucenePlusSort.

          Having the capability to do field selections from the query string can often be undesirable, if not even a security issue in some cases.

          Assuming the new parser is well written (and i would expect nothing less from all you lucid folks) i imagine it wouldn't take too much work to make all of those various options controllable by params. if nothing else, an "escapeTheseChars" type option would be nice brute force way to disable some stuff.

          Show
          Hoss Man added a comment - I haven't looked at the patch, and i hope to find some time to do so prior to year with a "3" in it, but in the meantime... I guess one of the main questions is if this should remain separate from dismax, or if it should eventually replace dismax? I think the right way to handle this is to let both classes coexist in the code base, but to start encouraging people to use the new/improved version by changing the default used when defType=dismax ... a note in CHANGES.txt that people who really want the old class should use defType=oldDismax, or explicitly declare the old parsing using the name "dismax" seems totally suitable (much as we did with LegacyDateField, and lucenePlusSort. Having the capability to do field selections from the query string can often be undesirable, if not even a security issue in some cases. Assuming the new parser is well written (and i would expect nothing less from all you lucid folks) i imagine it wouldn't take too much work to make all of those various options controllable by params. if nothing else, an "escapeTheseChars" type option would be nice brute force way to disable some stuff.
          Hide
          Erik Hatcher added a comment -

          Personally I think this should live independently from dismax. Having the capability to do field selections from the query string can often be undesirable, if not even a security issue in some cases.

          +1 to keeping them separate, or at least controllable whether the field selection and other crazy query syntax is available or not.

          Show
          Erik Hatcher added a comment - Personally I think this should live independently from dismax. Having the capability to do field selections from the query string can often be undesirable, if not even a security issue in some cases. +1 to keeping them separate, or at least controllable whether the field selection and other crazy query syntax is available or not.
          Hide
          Yonik Seeley added a comment -

          Yonik, did you mean to attach a patch, but forgot?

          Heh... yup. Here it is.

          Show
          Yonik Seeley added a comment - Yonik, did you mean to attach a patch, but forgot? Heh... yup. Here it is.
          Hide
          Otis Gospodnetic added a comment -

          I think you need to click on "Issue Links" link, delete, and re-link.

          I have a feeling once this is in, people won't need the original dismax.

          Yonik, did you mean to attach a patch, but forgot?

          Show
          Otis Gospodnetic added a comment - I think you need to click on "Issue Links" link, delete, and re-link. I have a feeling once this is in, people won't need the original dismax. Yonik, did you mean to attach a patch, but forgot?
          Hide
          Yonik Seeley added a comment -

          I guess one of the main questions is if this should remain separate from dismax, or if it should eventually replace dismax? If the latter, they could either co-exist for some time and eventually replace, or remain uncommitted before it can replace...

          Incompatibilities off the top of my head:

          • "pf" creates word bigrams for proximity boosting... we could perhaps change this to "pf2" and add in support for the dismax-style pf
          • fielded queries - dismax doesn't have them, and they may not be desired by all clients
          • full lucene syntax - same as above - may not be desired by all clients

          I think the other incompatibilities would mostly fall in the category of bugs, or things that weren't well specified or well behaved anyway (e.g. I don't really see a downside to allowing nested pure negative queries to work)

          Show
          Yonik Seeley added a comment - I guess one of the main questions is if this should remain separate from dismax, or if it should eventually replace dismax? If the latter, they could either co-exist for some time and eventually replace, or remain uncommitted before it can replace... Incompatibilities off the top of my head: "pf" creates word bigrams for proximity boosting... we could perhaps change this to "pf2" and add in support for the dismax-style pf fielded queries - dismax doesn't have them, and they may not be desired by all clients full lucene syntax - same as above - may not be desired by all clients I think the other incompatibilities would mostly fall in the category of bugs, or things that weren't well specified or well behaved anyway (e.g. I don't really see a downside to allowing nested pure negative queries to work)
          Hide
          Yonik Seeley added a comment -

          Oops, I meant to link the related dismax issues as "related to" not "blocks"... not sure how to undo it though.

          Show
          Yonik Seeley added a comment - Oops, I meant to link the related dismax issues as "related to" not "blocks"... not sure how to undo it though.
          Hide
          Yonik Seeley added a comment -

          Lucid Imagination is contributing back this query parser currently being used at search.lucidimagination.com

          The extended dismax parser was based on the original Solr dismax parser.

          • Supports full lucene query syntax in the absence of syntax errors
          • supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode
          • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them... in this mode, fielded queries, +/-, and phrase queries are still supported.
          • Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
          • advanced stopword handling... stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
          • Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
          • Supports pure negative nested queries... so a query like +foo (-foo) will match all documents

          Some examples of queries that currently give dismax fits that now work with extended dismax:
          OR
          AND
          NOT
          -
          "

          Example queries:
          http://localhost:8983/solr/select?defType=edismax&q=big+blue+house&pf=text&qf=text&debugQuery=true
          +(((text:big) (text:blue) (text:hous))~3) ((text:"big blue") (text:"blue hous"))

          http://localhost:8983/solr/select?defType=edismax&q=hello&pf=text&qf=text&boost=popularity&debugQuery=true
          boost(+(text:hello),int(popularity))

          And if the text field were configured with the stopfilter only on the query analyzer, then
          http://localhost:8983/solr/select?defType=edismax&q=this+is+the+end&pf=text&qf=text&debugQuery=true
          +(((text:end))~1) ((text:"this is") (text:"is the") (text:"the end"))

          http://localhost:8983/solr/select?defType=edismax&q=how+now+"brown+cow"+popularity:100&pf=text&qf=text&debugQuery=true
          +(((text:how) (text:now) (text:"brown cow") popularity:100)~4) (text:"how now")

          Show
          Yonik Seeley added a comment - Lucid Imagination is contributing back this query parser currently being used at search.lucidimagination.com The extended dismax parser was based on the original Solr dismax parser. Supports full lucene query syntax in the absence of syntax errors supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode When there are syntax errors, improved smart partial escaping of special characters is done to prevent them... in this mode, fielded queries, +/-, and phrase queries are still supported. Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field. advanced stopword handling... stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required. Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in Supports pure negative nested queries... so a query like +foo (-foo) will match all documents Some examples of queries that currently give dismax fits that now work with extended dismax: OR AND NOT - " Example queries: http://localhost:8983/solr/select?defType=edismax&q=big+blue+house&pf=text&qf=text&debugQuery=true +(((text:big) (text:blue) (text:hous))~3) ((text:"big blue") (text:"blue hous")) http://localhost:8983/solr/select?defType=edismax&q=hello&pf=text&qf=text&boost=popularity&debugQuery=true boost(+(text:hello),int(popularity)) And if the text field were configured with the stopfilter only on the query analyzer, then http://localhost:8983/solr/select?defType=edismax&q=this+is+the+end&pf=text&qf=text&debugQuery=true +(((text:end))~1) ((text:"this is") (text:"is the") (text:"the end")) http://localhost:8983/solr/select?defType=edismax&q=how+now+ "brown+cow"+popularity:100&pf=text&qf=text&debugQuery=true +(((text:how) (text:now) (text:"brown cow") popularity:100)~4) (text:"how now")

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Yonik Seeley
            • Votes:
              9 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development