Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.6, 7.0
    • Component/s: search
    • Labels:
      None

      Description

      Solr has no support for Lucene's PayloadScoreQuery, yet it has support for indexing payloads (via DelimitedPayloadTokenFilter or NumericPayloadTokenFilter).

      This issue adds value source and query parser support for leveraging payloads created by those token filters.

      1. payload_value_source.png
        101 kB
        Erik Hatcher
      2. PayloadTermQueryPlugin.java
        2 kB
        Erik Hatcher
      3. SOLR-1485.patch
        43 kB
        Erik Hatcher
      4. SOLR-1485.patch
        38 kB
        Erik Hatcher
      5. SOLR-1485.patch
        36 kB
        Erik Hatcher
      6. SOLR-1485.patch
        36 kB
        Erik Hatcher
      7. SOLR-1485.patch
        33 kB
        Erik Hatcher

        Issue Links

          Activity

          Hide
          ehatcher Erik Hatcher added a comment -

          This class adds a QParserPlugin to support creating PayloadTermQuery's.

          This can be registered in solrconfig.xml like this:

          <queryParser name="payload" class="org.apache.solr.search.PayloadTermQueryPlugin"/>

          A custom Similarity is needed to score payloads (not provided with this issue).

          Once everything is lined up right (payload indexed, similarity with scorePayload implemented), a query like this can be used:
          http://localhost:8983/solr/select?q=

          {!payload%20f=payloads%20func=avg}

          foo&debugQuery=true

          As can be seen with this explanation:
          1.4450715 = (MATCH) fieldWeight(payloads:foo in 0), product of:
          4.709331 = (MATCH) btq, product of:
          0.70710677 = tf(phraseFreq=0.5)
          6.66 = scorePayload(...)
          0.30685282 = idf(payloads: foo=1)
          1.0 = fieldNorm(field=payloads, doc=0)

          Show
          ehatcher Erik Hatcher added a comment - This class adds a QParserPlugin to support creating PayloadTermQuery's. This can be registered in solrconfig.xml like this: <queryParser name="payload" class="org.apache.solr.search.PayloadTermQueryPlugin"/> A custom Similarity is needed to score payloads (not provided with this issue). Once everything is lined up right (payload indexed, similarity with scorePayload implemented), a query like this can be used: http://localhost:8983/solr/select?q= {!payload%20f=payloads%20func=avg} foo&debugQuery=true As can be seen with this explanation: 1.4450715 = (MATCH) fieldWeight(payloads:foo in 0), product of: 4.709331 = (MATCH) btq, product of: 0.70710677 = tf(phraseFreq=0.5) 6.66 = scorePayload(...) 0.30685282 = idf(payloads: foo=1) 1.0 = fieldNorm(field=payloads, doc=0)
          Hide
          billa Bill Au added a comment -

          Eric, have you started on this? I recently wrote a QParserPlugin that supports PayloadTermQuery. It is very bear-bone but could be a good starting point. I can attach my code here to get things started.

          Show
          billa Bill Au added a comment - Eric, have you started on this? I recently wrote a QParserPlugin that supports PayloadTermQuery. It is very bear-bone but could be a good starting point. I can attach my code here to get things started.
          Hide
          billa Bill Au added a comment -

          Never mind. I just saw you update. Your code looks good.

          Show
          billa Bill Au added a comment - Never mind. I just saw you update. Your code looks good.
          Hide
          billa Bill Au added a comment -

          Eric, do you think we should support default field and default operator in the QParser used?

          Show
          billa Bill Au added a comment - Eric, do you think we should support default field and default operator in the QParser used?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Moving out of 1.4 since this is a new feature that isn't ready to commit.
          As written, it looks more like "rawpayload" or something since no analysis is done on the input.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Moving out of 1.4 since this is a new feature that isn't ready to commit. As written, it looks more like "rawpayload" or something since no analysis is done on the input.
          Hide
          billa Bill Au added a comment -

          I am +0 on including/excluding this from 1.4. FYI, Solr 1.4 already has a DelimitedPayloadTokenFilterFactory which uses the DelimitedPayloadTokenFIlter in Lucene. If we include this, I think we should also include a Similarity class for payload, either as part of this JIRA or a separate one.

          There is also a similar JIRA on query support:

          https://issues.apache.org/jira/browse/SOLR-1337

          Show
          billa Bill Au added a comment - I am +0 on including/excluding this from 1.4. FYI, Solr 1.4 already has a DelimitedPayloadTokenFilterFactory which uses the DelimitedPayloadTokenFIlter in Lucene. If we include this, I think we should also include a Similarity class for payload, either as part of this JIRA or a separate one. There is also a similar JIRA on query support: https://issues.apache.org/jira/browse/SOLR-1337
          Hide
          david256 david added a comment -

          Hi,
          What if I want to do a boolean query?
          like: payoladField:steve OR NonPayloadField:George ?

          Won't the payload plugin be used for all the query parts?

          Show
          david256 david added a comment - Hi, What if I want to do a boolean query? like: payoladField:steve OR NonPayloadField:George ? Won't the payload plugin be used for all the query parts?
          Hide
          lancenorskog Lance Norskog added a comment -

          Julien Noche posted last August that he had to create a new query parser variant of dismax. I cannot find an example of a query string in his post.

          Using Payloads with DisMaxQParser in SOLR

          Use cases for a payload-based query:

          • a raw byte stream
          • a serialized Java String
          • a number
          • a boolean value in the payload
          • "is there a payload?"
          • boosting a document if the search term has a payload
            • the payload is a number (packed float) created by

          Most of these can be encoded into a payload. But there are no matching decoders.
          There is no code that pulls the payload and uses the data.

          Show
          lancenorskog Lance Norskog added a comment - Julien Noche posted last August that he had to create a new query parser variant of dismax. I cannot find an example of a query string in his post. Using Payloads with DisMaxQParser in SOLR Use cases for a payload-based query: a raw byte stream a serialized Java String a number a boolean value in the payload "is there a payload?" boosting a document if the search term has a payload the payload is a number (packed float) created by Most of these can be encoded into a payload. But there are no matching decoders. There is no code that pulls the payload and uses the data.
          Hide
          ehatcher Erik Hatcher added a comment -

          Is there interest in rejuvenating this to get some form of a SpanTermQuery support into Solr? I'll take a stab at updating this to do like the

          {!term}

          query parser to factor in the field type and any needed analysis. Anything else?

          Perhaps for the dismax+payloads situation Lance mentioned, which will be a different issue altogether, we make the SolrQueryParser implementation used by (e)dismax pluggable that it uses, so that there can be a span-aware one?

          Show
          ehatcher Erik Hatcher added a comment - Is there interest in rejuvenating this to get some form of a SpanTermQuery support into Solr? I'll take a stab at updating this to do like the {!term} query parser to factor in the field type and any needed analysis. Anything else? Perhaps for the dismax+payloads situation Lance mentioned, which will be a different issue altogether, we make the SolrQueryParser implementation used by (e)dismax pluggable that it uses, so that there can be a span-aware one?
          Hide
          rdeck Roland Deck added a comment -

          Hi
          I tried the PayloadTermQueryPlugin today.
          To get the scores as mentioned above I had to change the code a little.

          Here is the relevant code fragment:

          @Override
          public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
          return new QParser(qstr, localParams, params, req) {
          public Query parse() throws ParseException

          { //rdeck: hint: lets try to set includeSpanCore to true. => Yes it works! (after having re-indexed all documents)! return new PayloadTermQuery( new Term(localParams.get(QueryParsing.F), localParams.get(QueryParsing.V)), createPayloadFunction(localParams.get("func")), true); //was originally false instead of true }

          };
          }

          with includeSpanCore = false, I get score = payload value
          with includeSpanCore = true, the payload takes part on the score calculation

          I have some questions left:

          1) Why is the PayloadTermQuery limited to just one field? Or will this change?
          2) How can I mix up queries containing parts which are payload dependent and others which aren't?

          Show
          rdeck Roland Deck added a comment - Hi I tried the PayloadTermQueryPlugin today. To get the scores as mentioned above I had to change the code a little. Here is the relevant code fragment: @Override public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new QParser(qstr, localParams, params, req) { public Query parse() throws ParseException { //rdeck: hint: lets try to set includeSpanCore to true. => Yes it works! (after having re-indexed all documents)! return new PayloadTermQuery( new Term(localParams.get(QueryParsing.F), localParams.get(QueryParsing.V)), createPayloadFunction(localParams.get("func")), true); //was originally false instead of true } }; } with includeSpanCore = false, I get score = payload value with includeSpanCore = true, the payload takes part on the score calculation I have some questions left: 1) Why is the PayloadTermQuery limited to just one field? Or will this change? 2) How can I mix up queries containing parts which are payload dependent and others which aren't?
          Hide
          otis Otis Gospodnetic added a comment -

          Erik Hatcher - not sure if you are watching SOLR-1337, so I'll write the same comment/Q here:

          My impression was that Span queries and Payloads are kind of pase in Luceneland.... no?
          If yes, should we Won't Fix this?

          Show
          otis Otis Gospodnetic added a comment - Erik Hatcher - not sure if you are watching SOLR-1337 , so I'll write the same comment/Q here: My impression was that Span queries and Payloads are kind of pase in Luceneland.... no? If yes, should we Won't Fix this?
          Hide
          gsingers Grant Ingersoll added a comment -

          I would say it would be good to support payloads, unless there is a better solution.

          Show
          gsingers Grant Ingersoll added a comment - I would say it would be good to support payloads, unless there is a better solution.
          Hide
          ehatcher Erik Hatcher added a comment -

          Anyone have thoughts on how best to implement the scorePayload() method in Solr? Should Solr have its own DefaultSimilarity subclass that implements it?

          It'd be great to at least get support for PayloadTermQuery in, such that it supports DelimitedPayloadTokenFilter created payloads. I suppose that means that scorePayload() will need to support at least float and integer decoding, based on introspecting the field type definition. What about "identity" encoding (throw an unsupported exception?)? Using a custom encoder would require a custom scorePayload(), and by default throw an exception on that too I presume.

          And what about other Similarity implementations and if/how to support those?

          I'm seeing that it's tough to put this into Solr in a general purpose way, but maybe we can at least get out of the box support for integer and float using the default similarity.

          Show
          ehatcher Erik Hatcher added a comment - Anyone have thoughts on how best to implement the scorePayload() method in Solr? Should Solr have its own DefaultSimilarity subclass that implements it? It'd be great to at least get support for PayloadTermQuery in, such that it supports DelimitedPayloadTokenFilter created payloads. I suppose that means that scorePayload() will need to support at least float and integer decoding, based on introspecting the field type definition. What about "identity" encoding (throw an unsupported exception?)? Using a custom encoder would require a custom scorePayload(), and by default throw an exception on that too I presume. And what about other Similarity implementations and if/how to support those? I'm seeing that it's tough to put this into Solr in a general purpose way, but maybe we can at least get out of the box support for integer and float using the default similarity.
          Hide
          erickerickson Erick Erickson added a comment -

          Happens that I put together an end-to-end example here: http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/, and part of that is a discussion I had with Hossman about whether a query parser approach or a fieldType approach would be better. Turns out each supports different capabilities.

          Personally, I think a fieldType would be a good thing since it should "just work".

          FWIW

          Show
          erickerickson Erick Erickson added a comment - Happens that I put together an end-to-end example here: http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/ , and part of that is a discussion I had with Hossman about whether a query parser approach or a fieldType approach would be better. Turns out each supports different capabilities. Personally, I think a fieldType would be a good thing since it should "just work". FWIW
          Hide
          ehatcher Erik Hatcher added a comment -

          Is there a reason not to use SchemaSimilarityFactory as the default Similarity moving forward? Relying on that would be nice, it seems.

          Show
          ehatcher Erik Hatcher added a comment - Is there a reason not to use SchemaSimilarityFactory as the default Similarity moving forward? Relying on that would be nice, it seems.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          Is there a reason not to use SchemaSimilarityFactory as the default Similarity moving forward? Relying on that would be nice, it seems.

          This is now the case I believe. Does that mean this issue can move forward? Or has payloads querying been implemented somewhere else by now?

          Show
          arafalov Alexandre Rafalovitch added a comment - Is there a reason not to use SchemaSimilarityFactory as the default Similarity moving forward? Relying on that would be nice, it seems. This is now the case I believe. Does that mean this issue can move forward? Or has payloads querying been implemented somewhere else by now?
          Hide
          markus17 Markus Jelsma added a comment -

          This is indeed the case but there is still no PayloadQParserPlugin, probably for the reason that there is also no Similarity implementation taking care of payload scoring.

          We have and use such a plugin and can provide it, and a BM25 similarity where the payload is directly used for scoring in the basic cases. But payloads can be used for much fancier applications, so would would Solr want to support? Without a similarity, the parser is useless.

          Show
          markus17 Markus Jelsma added a comment - This is indeed the case but there is still no PayloadQParserPlugin, probably for the reason that there is also no Similarity implementation taking care of payload scoring. We have and use such a plugin and can provide it, and a BM25 similarity where the payload is directly used for scoring in the basic cases. But payloads can be used for much fancier applications, so would would Solr want to support? Without a similarity, the parser is useless.
          Hide
          ehatcher Erik Hatcher added a comment -

          work in progress checkpoint. two new query parsers, mainly

          {!payload_score}

          , and a payload (float) value source.

          Show
          ehatcher Erik Hatcher added a comment - work in progress checkpoint. two new query parsers, mainly {!payload_score} , and a payload (float) value source.
          Hide
          ehatcher Erik Hatcher added a comment -

          It's coming together, finally! Here's an example field and field type:

            "add-field-type": {
              "name": "delimited_float_payloads",
              "class": "solr.TextField",
              "positionIncrementGap": "100",
              "indexAnalyzer": {
                "tokenizer": {
                  "class": "solr.WhitespaceTokenizerFactory"
                },
                "filters": [
                  {
                    "class": "solr.DelimitedPayloadTokenFilterFactory",
                    "encoder": "float"
                  }
                ]
              },
              "queryAnalyzer": {
                "tokenizer": {
                  "class": "solr.WhitespaceTokenizerFactory"
                }
              }
            },
            "add-field" : {
              "name":"delimited_float_payloads",
              "type":"delimited_float_payloads",
              "stored": "true",
              "multiValued": "true"
            }
          

          Given that field definition, here's some example data indexed:

          bin/post -c test -type text/csv -out yes -d $'id,delimited_float_payloads\n1,one|1.0 two|2.0 three|3.0\n2,weighted|50.0 weighted|100.0'
          

          Here's a query that returns the computed payload function for the term "two" (which has a float value of 2.0):

          http://localhost:9999/solr/test/select?q=*:*&fl=*,p:payload(delimited_float_payloads,two),score
          
          Show
          ehatcher Erik Hatcher added a comment - It's coming together, finally! Here's an example field and field type: "add-field-type" : { "name" : "delimited_float_payloads" , "class" : "solr.TextField" , "positionIncrementGap" : "100" , "indexAnalyzer" : { "tokenizer" : { "class" : "solr.WhitespaceTokenizerFactory" }, "filters" : [ { "class" : "solr.DelimitedPayloadTokenFilterFactory" , "encoder" : " float " } ] }, "queryAnalyzer" : { "tokenizer" : { "class" : "solr.WhitespaceTokenizerFactory" } } }, "add-field" : { "name" : "delimited_float_payloads" , "type" : "delimited_float_payloads" , "stored" : " true " , "multiValued" : " true " } Given that field definition, here's some example data indexed: bin/post -c test -type text/csv -out yes -d $'id,delimited_float_payloads\n1,one|1.0 two|2.0 three|3.0\n2,weighted|50.0 weighted|100.0' Here's a query that returns the computed payload function for the term "two" (which has a float value of 2.0): http: //localhost:9999/solr/test/select?q=*:*&fl=*,p:payload(delimited_float_payloads,two),score
          Hide
          ehatcher Erik Hatcher added a comment - - edited

          There's three pieces in that previous patch:

          • FloatPayloadValueSource: the payload(...) function
          • PayloadScoreQParserPlugin: the {!payload_score} qparser
          • PayloadCheckQParserPlugin: the {!payload_check} qparser

          The focus so far has really been the payload(...) function. The query parsers need more work, as at the moment they only support a term query.

          Show
          ehatcher Erik Hatcher added a comment - - edited There's three pieces in that previous patch: FloatPayloadValueSource : the payload(...) function PayloadScoreQParserPlugin : the { !payload_score } qparser PayloadCheckQParserPlugin : the { !payload_check } qparser The focus so far has really been the payload(...) function. The query parsers need more work, as at the moment they only support a term query.
          Hide
          dsmiley David Smiley added a comment -

          Just a couple tidbits of feedback on the patch:

          • computePayloadFactor would probably get called a ton of times, thus it's worth trying to ensure it's implementation is quick. I see you having it doing "integer" & "float" string comparisons. That can be done beforehand; perhaps declare a java.util.ToDoubleFunction initialized with one of 3 lambdas depending on the situation.
          • FloatPayloadValueSource confuses me... ValueSources produce a value per document yet payloads are per term occurrence. What use-case do you have in mind?
          Show
          dsmiley David Smiley added a comment - Just a couple tidbits of feedback on the patch: computePayloadFactor would probably get called a ton of times, thus it's worth trying to ensure it's implementation is quick. I see you having it doing "integer" & "float" string comparisons. That can be done beforehand; perhaps declare a java.util.ToDoubleFunction initialized with one of 3 lambdas depending on the situation. FloatPayloadValueSource confuses me... ValueSources produce a value per document yet payloads are per term occurrence . What use-case do you have in mind?
          Hide
          ehatcher Erik Hatcher added a comment -

          David Smiley - thanks for having a look.

          I'll rework to the lambda recommendation, thanks!

          FloatPayloadValueSource has this function signature: payload(field,value[,default, 'min|max|average']), where the default behavior currently is to average the payload floats for the matching terms. Your following question gets at the heart of one of the primary use cases for this, unique terms per-field, where the idea is to return only a single float value (for sorting, other functioning, pseudo-fielding, and range/bucket/query faceting). Multiple matching terms per doc serves different use cases though, such as down-boosting synonyms (via NumericPayloadTokenFilter), or averaging a bunch of entity matches, each carrying some "weight" factor.

          Thanks again, David.

          Show
          ehatcher Erik Hatcher added a comment - David Smiley - thanks for having a look. I'll rework to the lambda recommendation, thanks! FloatPayloadValueSource has this function signature: payload(field,value [,default, 'min|max|average'] ) , where the default behavior currently is to average the payload floats for the matching terms. Your following question gets at the heart of one of the primary use cases for this, unique terms per-field, where the idea is to return only a single float value (for sorting, other functioning, pseudo-fielding, and range/bucket/query faceting). Multiple matching terms per doc serves different use cases though, such as down-boosting synonyms (via NumericPayloadTokenFilter), or averaging a bunch of entity matches, each carrying some "weight" factor. Thanks again, David.
          Hide
          ehatcher Erik Hatcher added a comment -

          implements string checking and lambda creation in ctor instead of nested computePayloadFactor function

          Show
          ehatcher Erik Hatcher added a comment - implements string checking and lambda creation in ctor instead of nested computePayloadFactor function
          Hide
          ehatcher Erik Hatcher added a comment -

          David Smiley - I wasn't quite sure how to use ToDoubleFunction in this case, but I took a stab at a lambda with an interface. Payloads really need an encoder/decoder coupled utility anyway, so this is a start along that idea as well.

          Show
          ehatcher Erik Hatcher added a comment - David Smiley - I wasn't quite sure how to use ToDoubleFunction in this case, but I took a stab at a lambda with an interface. Payloads really need an encoder/decoder coupled utility anyway, so this is a start along that idea as well.
          Hide
          ehatcher Erik Hatcher added a comment -

          last patch had a recursive loop issue on the SimScorer (this instead of the delegate)

          Show
          ehatcher Erik Hatcher added a comment - last patch had a recursive loop issue on the SimScorer ( this instead of the delegate)
          Hide
          ehatcher Erik Hatcher added a comment - - edited

          While I'm at it, I found that Lucene's PayloadScoreQuery had a .equals/hashCode/cache bug in that it didn't take into account the includeSpanScore flag - fixed here.

          Show
          ehatcher Erik Hatcher added a comment - - edited While I'm at it, I found that Lucene's PayloadScoreQuery had a .equals/hashCode/cache bug in that it didn't take into account the includeSpanScore flag - fixed here.
          Hide
          ehatcher Erik Hatcher added a comment -

          a few fiddly changes, but most relevantly added three dynamic fields/types to the data_driven configuration: *_dpf, *_dpi, and *_dps, for delimited_payloads float, int, and string.

          Show
          ehatcher Erik Hatcher added a comment - a few fiddly changes, but most relevantly added three dynamic fields/types to the data_driven configuration: *_dpf, *_dpi, and *_dps, for delimited_payloads float, int, and string.
          Hide
          ehatcher Erik Hatcher added a comment - - edited

          With the latest patch, one can index payloaded terms like this:

              bin/solr create -c payloads
              bin/post -c payloads -type text/csv -out yes -d $'id,weighted_terms_dpf\n1,one|1.0 two|2.0 three|3.0\n2,weighted|50.0 weighted|100.0'
          
          # this is two documents:
          #     1: one|1.0 two|2.0 three|3.0
          #     2: weighted|50.0 weighted|100.0
          

          and then get the values back as scores like this:

              http://localhost:8983/solr/payloads/select?q={!payload_score f=weighted_terms_dpf v=$payload_term func=max}&fl=id,*_dpf,score&wt=csv&payload_term=two
              * &payload_term=two
              id,score
              1,2.0
          
              * &payload_term=three:
              id,score
              1,3.0
              
              * &payload_term=weighted (func=max):
              id,score
              2,100.0
          
              * &payload_term=weighted (func=min):
              id,score
              2,50.0
              
              * &payload_term=weighted (func=average):
              id,score
              2,75.0    
          
          Show
          ehatcher Erik Hatcher added a comment - - edited With the latest patch, one can index payloaded terms like this: bin/solr create -c payloads bin/post -c payloads -type text/csv -out yes -d $'id,weighted_terms_dpf\n1,one|1.0 two|2.0 three|3.0\n2,weighted|50.0 weighted|100.0' # this is two documents: # 1: one|1.0 two|2.0 three|3.0 # 2: weighted|50.0 weighted|100.0 and then get the values back as scores like this: http: //localhost:8983/solr/payloads/select?q={!payload_score f=weighted_terms_dpf v=$payload_term func=max}&fl=id,*_dpf,score&wt=csv&payload_term=two * &payload_term=two id,score 1,2.0 * &payload_term=three: id,score 1,3.0 * &payload_term=weighted (func=max): id,score 2,100.0 * &payload_term=weighted (func=min): id,score 2,50.0 * &payload_term=weighted (func=average): id,score 2,75.0
          Hide
          ehatcher Erik Hatcher added a comment - - edited

          It's interesting to note that there is actually one query built in that could handle payload scoring already: q={!xmlparser}<BoostingTermQuery fieldName="weighted_terms_dpf">weighted</BoostingTermQuery>, with hardcoded "average" function and includeSpanScore=true, resulting in the above data example an output of:

          id,score
          2,71.00154
          

          And here's a patch that adds includeSpanScore control flexibility to <BoostingTermQuery>:

          --- a/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/BoostingTermBuilder.java
          +++ b/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/BoostingTermBuilder.java
          @@ -34,11 +34,12 @@ public class BoostingTermBuilder extends SpanBuilderBase {
             @Override
             public SpanQuery getSpanQuery(Element e) throws ParserException {
               String fieldName = DOMUtils.getAttributeWithInheritanceOrFail(e, "fieldName");
          +    boolean includeSpanQuery = DOMUtils.getAttribute(e, "includeSpanScore", true);
               String value = DOMUtils.getNonBlankTextOrFail(e);
           
          -    // TODO: add parameter to control PayloadScoreQuery's `includeSpanScore` and `PayloadFunction`
          +    // TODO: add parameter to control PayloadScoreQuery's and `PayloadFunction`
               SpanQuery btq = new PayloadScoreQuery(new SpanTermQuery(new Term(fieldName, value)),
          -        new AveragePayloadFunction());
          +        new AveragePayloadFunction(), includeSpanQuery);
               btq = new SpanBoostQuery(btq, DOMUtils.getAttribute(e, "boost", 1.0f));
               return btq;
             }
          
          Show
          ehatcher Erik Hatcher added a comment - - edited It's interesting to note that there is actually one query built in that could handle payload scoring already: q={!xmlparser}<BoostingTermQuery fieldName="weighted_terms_dpf">weighted</BoostingTermQuery> , with hardcoded "average" function and includeSpanScore=true, resulting in the above data example an output of: id,score 2,71.00154 And here's a patch that adds includeSpanScore control flexibility to <BoostingTermQuery> : --- a/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/BoostingTermBuilder.java +++ b/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/BoostingTermBuilder.java @@ -34,11 +34,12 @@ public class BoostingTermBuilder extends SpanBuilderBase { @Override public SpanQuery getSpanQuery(Element e) throws ParserException { String fieldName = DOMUtils.getAttributeWithInheritanceOrFail(e, "fieldName" ); + boolean includeSpanQuery = DOMUtils.getAttribute(e, "includeSpanScore" , true ); String value = DOMUtils.getNonBlankTextOrFail(e); - // TODO: add parameter to control PayloadScoreQuery's `includeSpanScore` and `PayloadFunction` + // TODO: add parameter to control PayloadScoreQuery's and `PayloadFunction` SpanQuery btq = new PayloadScoreQuery( new SpanTermQuery( new Term(fieldName, value)), - new AveragePayloadFunction()); + new AveragePayloadFunction(), includeSpanQuery); btq = new SpanBoostQuery(btq, DOMUtils.getAttribute(e, "boost" , 1.0f)); return btq; }
          Hide
          ehatcher Erik Hatcher added a comment -

          I've started committing to branch jira/SOLR-1485: https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/SOLR-1485 where I've just committed the payload() function (and tests and helper utilities)

          Show
          ehatcher Erik Hatcher added a comment - I've started committing to branch jira/ SOLR-1485 : https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/SOLR-1485 where I've just committed the payload() function (and tests and helper utilities)
          Hide
          ehatcher Erik Hatcher added a comment -

          I've now committed/pushed everything to the branch needed for the payload() function, including the dynamic field types for delimited payloading of terms. I think we could call this JIRA satisfied with what's committed so far, however I've got these pieces in addition locally: PayloadCheckQParserPlugin, PayloadScoreQParserPlugin.java, PayloadScoringSimilarityWrapper.java, and an improvement to xmlparser's <BoostingTermQuery> - may include these along with this issue, not sure yet.

          Show
          ehatcher Erik Hatcher added a comment - I've now committed/pushed everything to the branch needed for the payload() function, including the dynamic field types for delimited payloading of terms. I think we could call this JIRA satisfied with what's committed so far, however I've got these pieces in addition locally: PayloadCheckQParserPlugin, PayloadScoreQParserPlugin.java, PayloadScoringSimilarityWrapper.java, and an improvement to xmlparser's <BoostingTermQuery> - may include these along with this issue, not sure yet.
          Hide
          ehatcher Erik Hatcher added a comment -

          I've completed both query parsers as well on the jira/SOLR-1485 branch.

          Any review takers before I merge this over master and then to 6.x? David Smiley? Erick Erickson? I wonder if there'll be any API issues between master and 6x?

          Show
          ehatcher Erik Hatcher added a comment - I've completed both query parsers as well on the jira/ SOLR-1485 branch. Any review takers before I merge this over master and then to 6.x? David Smiley ? Erick Erickson ? I wonder if there'll be any API issues between master and 6x?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6c565c001bb48af0c37a4d4909ba2f98d51e7fe6 in lucene-solr's branch refs/heads/master from Erik Hatcher
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c565c0 ]

          SOLR-1485: Add payload support

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6c565c001bb48af0c37a4d4909ba2f98d51e7fe6 in lucene-solr's branch refs/heads/master from Erik Hatcher [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c565c0 ] SOLR-1485 : Add payload support
          Hide
          ehatcher Erik Hatcher added a comment -

          I'm not seeing the git message here, but I've committed to branch_6x now too, including a minor tweak to the similarity wrapper due to API change on master. https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/branch_6x

          Show
          ehatcher Erik Hatcher added a comment - I'm not seeing the git message here, but I've committed to branch_6x now too, including a minor tweak to the similarity wrapper due to API change on master. https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/branch_6x
          Hide
          ehatcher Erik Hatcher added a comment -

          Leaving this ticket open until the documentation is done, but otherwise, whew, DONE!

          Show
          ehatcher Erik Hatcher added a comment - Leaving this ticket open until the documentation is done, but otherwise, whew, DONE!
          Hide
          ehatcher Erik Hatcher added a comment -

          long time coming, thanks for the encouragement and patience. finally, payload support in Solr on the query/function side of things. This includes three pieces:

          • payload() value source function
          • payload_score query parser
          • payload_check query parser

          Docs for these have been added to the cwiki.

          Show
          ehatcher Erik Hatcher added a comment - long time coming, thanks for the encouragement and patience. finally, payload support in Solr on the query/function side of things. This includes three pieces: payload() value source function payload_score query parser payload_check query parser Docs for these have been added to the cwiki.
          Hide
          dsmiley David Smiley added a comment -

          Apologies for reviewing late.
          I'm looking at your documentation and what you committed.

          • It's not documented that PayloadScoreQuery parses the value as a SpanNearQuery (or SpanTermQuery if 1 term). Can you document this? Or if it's too much implementation detail then document the subset that you think is appropriate. (aside: it'd be nice if you let the user pick the query parser and query like other query parsers do see BoostQParserPlugin and how it calls subQuery).
          • PayloadUtils.createSpanQuery: TokenStream implements Closeable and I see that you called close(). But you should use try-with-resources to ensure it gets called. (Personally I think of this immediately I see code that calls close(); I ask myself why not try-with-resources). From experience with buggy analysis components (that I had a hand in writing , this is a serious issue since a forgotten close can actually put the entire Solr node into an inoperative state! (due to stringent checks in TokenStream and that analysis components get cached in ThreadLocal)
          Show
          dsmiley David Smiley added a comment - Apologies for reviewing late. I'm looking at your documentation and what you committed. It's not documented that PayloadScoreQuery parses the value as a SpanNearQuery (or SpanTermQuery if 1 term). Can you document this? Or if it's too much implementation detail then document the subset that you think is appropriate. (aside: it'd be nice if you let the user pick the query parser and query like other query parsers do see BoostQParserPlugin and how it calls subQuery). PayloadUtils.createSpanQuery: TokenStream implements Closeable and I see that you called close(). But you should use try-with-resources to ensure it gets called. (Personally I think of this immediately I see code that calls close(); I ask myself why not try-with-resources). From experience with buggy analysis components (that I had a hand in writing , this is a serious issue since a forgotten close can actually put the entire Solr node into an inoperative state! (due to stringent checks in TokenStream and that analysis components get cached in ThreadLocal)
          Hide
          ehatcher Erik Hatcher added a comment -

          David Smiley thanks again for your review - much appreciated.

          • Query parsing - yeah, that was a tricky one and I opted for the simplest thing that made sense given the available payloading of terms will be done by way of DelimitedPayloadTokenFilter, so rather than trying phrase, boolean, and other things just delegate to the analyzer and build a SpanTermQuery or SpanNearQuery with zero slop and ordered. I just documented this on the cwiki as well as a new upcoming commit that will add it to PayloadUtils.createSpanQuery(). On the parsing decision I looked around to find a way to create a SpanQuery conveniently (both PayloadScoreQuery and SpanPayloadCheckQuery require a SpanQuery, not just a Query). QueryBuilder.createSpanQuery() seemed the ticket, but alas it isn't public and also suffers from the troubles with TokenStream API hurdles that you mention - so I adapted it with the fixes that you spotted.
          • Thanks for the try-with-resources suggestion - implemented it locally (and running tests/precommit!) now but will commit that shortly.
          • Back to query parsing - I'm definitely open to anything that will work, but again a SpanQuery is the main constraint here. I don't quite see how subQuery parsing could be used here. Any suggestions/improvements along these lines are more than welcome. (my primary use case has been the payload() function so the query parsing beyond the quite useful as-is SpanTermQuery wasn't a big thrust)
          Show
          ehatcher Erik Hatcher added a comment - David Smiley thanks again for your review - much appreciated. Query parsing - yeah, that was a tricky one and I opted for the simplest thing that made sense given the available payloading of terms will be done by way of DelimitedPayloadTokenFilter, so rather than trying phrase, boolean, and other things just delegate to the analyzer and build a SpanTermQuery or SpanNearQuery with zero slop and ordered. I just documented this on the cwiki as well as a new upcoming commit that will add it to PayloadUtils.createSpanQuery(). On the parsing decision I looked around to find a way to create a SpanQuery conveniently (both PayloadScoreQuery and SpanPayloadCheckQuery require a SpanQuery , not just a Query). QueryBuilder.createSpanQuery() seemed the ticket, but alas it isn't public and also suffers from the troubles with TokenStream API hurdles that you mention - so I adapted it with the fixes that you spotted. Thanks for the try-with-resources suggestion - implemented it locally (and running tests/precommit!) now but will commit that shortly. Back to query parsing - I'm definitely open to anything that will work, but again a SpanQuery is the main constraint here. I don't quite see how subQuery parsing could be used here. Any suggestions/improvements along these lines are more than welcome. (my primary use case has been the payload() function so the query parsing beyond the quite useful as-is SpanTermQuery wasn't a big thrust)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5d42177b9290b61c658154e42223408944cd4bc1 in lucene-solr's branch refs/heads/master from Erik Hatcher
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d42177 ]

          SOLR-1485: improve TokenStream API usage

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5d42177b9290b61c658154e42223408944cd4bc1 in lucene-solr's branch refs/heads/master from Erik Hatcher [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d42177 ] SOLR-1485 : improve TokenStream API usage
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d496d17269f6acdc77fa06549ae3b4d91a05c321 in lucene-solr's branch refs/heads/branch_6x from Erik Hatcher
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d496d17 ]

          SOLR-1485: improve TokenStream API usage

          Show
          jira-bot ASF subversion and git services added a comment - Commit d496d17269f6acdc77fa06549ae3b4d91a05c321 in lucene-solr's branch refs/heads/branch_6x from Erik Hatcher [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d496d17 ] SOLR-1485 : improve TokenStream API usage
          Hide
          dsmiley David Smiley added a comment -

          The docs look good as well as your changes. Thanks Erik.
          I understand RE SpanTermQuery being sufficient for the use-cases at hand; and admittedly I'm not quite sure what use case would demand the flexibility of some sub-queryParser.

          Show
          dsmiley David Smiley added a comment - The docs look good as well as your changes. Thanks Erik. I understand RE SpanTermQuery being sufficient for the use-cases at hand; and admittedly I'm not quite sure what use case would demand the flexibility of some sub-queryParser.
          Hide
          ehatcher Erik Hatcher added a comment - - edited

          At the very least, if you want to do some lowercasing, or other token filtering, one could do that either before or after DelimitedPayloadTokenFilter, and the current implementation will at least match exact'ish phrases and payload score or check them - so it's actually not a bad first pass. (At first I implemented it with simple space splitting, as *_dps/*_dpf/*_dpi do, but then added analysis to add more configurability).

          Show
          ehatcher Erik Hatcher added a comment - - edited At the very least, if you want to do some lowercasing, or other token filtering, one could do that either before or after DelimitedPayloadTokenFilter, and the current implementation will at least match exact'ish phrases and payload score or check them - so it's actually not a bad first pass. (At first I implemented it with simple space splitting, as *_dps/*_dpf/*_dpi do, but then added analysis to add more configurability).

            People

            • Assignee:
              ehatcher Erik Hatcher
              Reporter:
              ehatcher Erik Hatcher
            • Votes:
              8 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development