Solr
  1. Solr
  2. SOLR-418

Editorial Query Boosting Component

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None

      Description

      For a given query string, a human editor can say what documents should be important. This is related to a lucene discussion:
      http://www.nabble.com/Forced-Top-Document-tf4682070.html#a13408965

      Ideally, the position could be determined explicitly by the editor - otherwise increasing the boost is probably sufficient.

      This patch uses the Search Component framework to inject custom document boosting into the standard SearchHandler.

      1. SOLR-418-QueryBoosting.patch
        41 kB
        Yonik Seeley
      2. SOLR-418-QueryBoosting.patch
        42 kB
        Ryan McKinley
      3. SOLR-418-QueryBoosting.patch
        41 kB
        Ryan McKinley
      4. SOLR-418-QueryBoosting.patch
        40 kB
        Ryan McKinley
      5. SOLR-418-QueryBoosting.patch
        28 kB
        Ryan McKinley
      6. SOLR-418-QueryBoosting.patch
        52 kB
        Ryan McKinley
      7. SOLR-418-QueryBoosting.patch
        53 kB
        Ryan McKinley
      8. SOLR-418-QueryBoosting.patch
        47 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment - - edited

          Here is a first draft that includes recent changes to SOLR-281. This is incomplete and is posted to get early feedback and advice.

          This component loads a file and builds a map of queries to special documents. The format is:

          <boost>
           <query text="XXXX">
            <doc id="1" priority="1" />
           </query>
           <query text="YYYY">
            <doc id="1" priority="1" />
            <doc id="2" priority="3" />
           </query>
           <query text="ZZZZ">
            <doc id="1" priority="1" />
            <doc id="2" priority="3" />
            <doc id="3" priority="5" />
           </query>
          </boost>
          

          for the query "YYYY", document 1 should be in position 1, document 2 in position 3.
          I considered a .csv style format:
          id,priority,phrase
          or
          phrase,[id,priority]+
          but I think the XML equivalent will be easier to edit/maintain.

          The search handler is configured with:

          <searchComponent name="boost" class="org.apache.solr.handler.component.QueryBoostingComponent" >
              <str name="analyzer">string</str>
              <str name="boosts">boost.xml</str>
            </searchComponent>
           
            <requestHandler name="/boost" class="solr.SearchHandler">
              <arr name="last-components">
                <str>boost</str>
              </arr>
            </requestHandler>
          

          The <str name="analyzer">string</str> bit chooses a fieldType (from schema.xml) and uses that to normalize input strings. This lets us reuse existing lowercase/trim/pattern/etc filters.

          For sorting, I think the best approach is to use a custom sort when sorting by score. (This isn't implemented yet)

          Currently for a matching query, this converts the query using:

                // Build a query to match the forced documents:
                // (id:1 id:2 id:3 id:4 id:5)^0
                BooleanQuery boosted = new BooleanQuery( true );
                for( Booster b : booster ) {
                  TermQuery tq = new TermQuery( new Term( idField, b.id ) );
                  boosted.add( tq, BooleanClause.Occur.SHOULD );
                }
                boosted.setBoost( 0 ); // don't affect the score
                
                // Change the query to insert forced documents
                BooleanQuery newq = new BooleanQuery( true );
                newq.add( query, BooleanClause.Occur.SHOULD );
                newq.add( boosted, BooleanClause.Occur.SHOULD );
                builder.setQuery( newq );
          

          For debugging, check:
          http://localhost:8983/solr/boost?q=ZZZZ&debugQuery=true

          Any feedback would be great!

          Show
          Ryan McKinley added a comment - - edited Here is a first draft that includes recent changes to SOLR-281 . This is incomplete and is posted to get early feedback and advice. This component loads a file and builds a map of queries to special documents. The format is: <boost> <query text= "XXXX" > <doc id= "1" priority= "1" /> </query> <query text= "YYYY" > <doc id= "1" priority= "1" /> <doc id= "2" priority= "3" /> </query> <query text= "ZZZZ" > <doc id= "1" priority= "1" /> <doc id= "2" priority= "3" /> <doc id= "3" priority= "5" /> </query> </boost> for the query "YYYY", document 1 should be in position 1, document 2 in position 3. I considered a .csv style format: id,priority,phrase or phrase, [id,priority] + but I think the XML equivalent will be easier to edit/maintain. The search handler is configured with: <searchComponent name= "boost" class= "org.apache.solr.handler.component.QueryBoostingComponent" > <str name= "analyzer" > string </str> <str name= "boosts" > boost.xml </str> </searchComponent> <requestHandler name= "/boost" class= "solr.SearchHandler" > <arr name= "last-components" > <str> boost </str> </arr> </requestHandler> The <str name="analyzer">string</str> bit chooses a fieldType (from schema.xml) and uses that to normalize input strings. This lets us reuse existing lowercase/trim/pattern/etc filters. For sorting, I think the best approach is to use a custom sort when sorting by score. (This isn't implemented yet) Currently for a matching query, this converts the query using: // Build a query to match the forced documents: // (id:1 id:2 id:3 id:4 id:5)^0 BooleanQuery boosted = new BooleanQuery( true ); for ( Booster b : booster ) { TermQuery tq = new TermQuery( new Term( idField, b.id ) ); boosted.add( tq, BooleanClause.Occur.SHOULD ); } boosted.setBoost( 0 ); // don't affect the score // Change the query to insert forced documents BooleanQuery newq = new BooleanQuery( true ); newq.add( query, BooleanClause.Occur.SHOULD ); newq.add( boosted, BooleanClause.Occur.SHOULD ); builder.setQuery( newq ); For debugging, check: http://localhost:8983/solr/boost?q=ZZZZ&debugQuery=true Any feedback would be great!
          Hide
          Ryan McKinley added a comment -

          Here is an updated patch that implements sorting. Rather then try to mix boosted and normal results, this uses a custom sort to put the boosted results at the top. The boost.xml format is now:

           <query text="ZZZZ">
            <doc id="1" />
            <doc id="2" />
            <doc id="3" />
           </query>
          

          For the query "ZZZZ" documents 1,2,3 will be the first docs returned followed by anything normally matching "ZZZZ"

          If the query specifies a sort, it will be respected. Only SCORE sorts are modified to boost
          the configured documents.

          Show
          Ryan McKinley added a comment - Here is an updated patch that implements sorting. Rather then try to mix boosted and normal results, this uses a custom sort to put the boosted results at the top. The boost.xml format is now: <query text= "ZZZZ" > <doc id= "1" /> <doc id= "2" /> <doc id= "3" /> </query> For the query "ZZZZ" documents 1,2,3 will be the first docs returned followed by anything normally matching "ZZZZ" If the query specifies a sort, it will be respected. Only SCORE sorts are modified to boost the configured documents.
          Hide
          Otis Gospodnetic added a comment -

          It seems like even this last bit would be great to make configurable:

          "If the query specifies a sort, it will be respected. Only SCORE sorts are modified to boost the configured documents."

          In other words, make it possible to force docs in boost.xml to show up in appropriate positions regardless of the sort type.

          Also, perhaps references to 'boost(s)' should now be renamed, so there is no confusion? Isn't the "industry standard" for this type of stuff "one box"?

          Show
          Otis Gospodnetic added a comment - It seems like even this last bit would be great to make configurable: "If the query specifies a sort, it will be respected. Only SCORE sorts are modified to boost the configured documents." In other words, make it possible to force docs in boost.xml to show up in appropriate positions regardless of the sort type. Also, perhaps references to 'boost(s)' should now be renamed, so there is no confusion? Isn't the "industry standard" for this type of stuff "one box"?
          Hide
          Ryan McKinley added a comment - - edited

          I agree with changing the name from "boosts" to something else... what is "one box"? (Google points me to their new search appliance

          re always putting the 'boosted' docs first... I'm not against making this configurable, but is seems wrong.

          If you want to force the sort to have the boosted docs first, isn't that:

              <lst name="invariants">
                <str name="sort">score desc</str>
             </lst>
          

          Is there a real use case to have 'sort=date desc' put the boosted docs first?

          Show
          Ryan McKinley added a comment - - edited I agree with changing the name from "boosts" to something else... what is "one box"? (Google points me to their new search appliance re always putting the 'boosted' docs first... I'm not against making this configurable, but is seems wrong. If you want to force the sort to have the boosted docs first, isn't that: <lst name= "invariants" > <str name= "sort" > score desc </str> </lst> Is there a real use case to have 'sort=date desc' put the boosted docs first?
          Hide
          Yonik Seeley added a comment -

          It seems like the user should be in control of if these docs are added & sorted first, regardless of what the regular sort is.

          Show
          Yonik Seeley added a comment - It seems like the user should be in control of if these docs are added & sorted first, regardless of what the regular sort is.
          Hide
          Mike Klaas added a comment -

          I think this makes a lot of sense, though I wonder if it might make sense to uniquify queries based on more than the query string. Certainly the results for a given query would depend greatly on the match-affecting parameters, f.i., fq= of dismax. This seems part of the "intrinsic query" to me. Sort does too, but I don't use it much so I'm not sure if my intuition is to be trusted there.

          Show
          Mike Klaas added a comment - I think this makes a lot of sense, though I wonder if it might make sense to uniquify queries based on more than the query string. Certainly the results for a given query would depend greatly on the match-affecting parameters, f.i., fq= of dismax. This seems part of the "intrinsic query" to me. Sort does too, but I don't use it much so I'm not sure if my intuition is to be trusted there.
          Hide
          Ryan McKinley added a comment -

          To be clear, this respects filter queries. For:
          http://localhost:8983/solr/boost?q=ZZZZ&debugQuery=true&fq=id:2
          only id:2 is returned even though 1&3 are boosted.

          I suppose we could do something to make the intrinsic query include other fields. Perhaps

          <boost>
           <query>
            <param name="q">string</param>
            <param name="fq">another</param>
           </query>
           <docs>
            <doc id="1" />
            <doc id="2" />
            <doc id="3" />
           </docs>
          </boost>
          

          or

          <query params="q=string&fq=another">          
            <doc id="1" />
          </query>                    
          

          but I think this gets more complicated then necessary. For the cases I can think of where you would want different docs boosted, you could just register a different handler with different boosted docs / invariants. This kind of functionality only really makes sense with dismax style user queries rather then standard lucene query syntax. That is "dog" rather then "name:dog^3 content:dog^1"


          re terminology. Maybe using the word "boost" will get too confusing. Perhaps "elevate", "promote", "force top documents"?

          rather then the 'QueryBoostingComponent', this could be the DocumentElevationComponent

          <elevate>
           <query phrase="XXXX">
            <doc id="1"/>
           </query>
           <query text="YYYY">
            <doc id="1" />
            <doc id="2" />
           </query>
          </elevate>
          

          The fastsearch glossary has a few terms that may be relevant?

          Absolute boosting

          Absolute boosting enables a document to be consistently displayed at a given position in the result set when a user searches with a specific query. It also prevents individual documents from being displayed when a user searches with a specific query.

          Under boosting, they have:

          Boosting may be applied in two ways:

          • Query independent (document boosting). This is used to boost high quality pages for all queries that match the document
          • Query dependant (query boosting). In this case specific documents may be boosted for given queries

          Their "Absolute boosting" description makes me wonder if we should add a flag to "burry" or "hide" a document for a given query. maybe:

           <doc id="2" hide="true"/>
          
          Show
          Ryan McKinley added a comment - To be clear, this respects filter queries. For: http://localhost:8983/solr/boost?q=ZZZZ&debugQuery=true&fq=id:2 only id:2 is returned even though 1&3 are boosted. I suppose we could do something to make the intrinsic query include other fields. Perhaps <boost> <query> <param name= "q" > string </param> <param name= "fq" > another </param> </query> <docs> <doc id= "1" /> <doc id= "2" /> <doc id= "3" /> </docs> </boost> or <query params= "q=string&fq=another" > <doc id= "1" /> </query> but I think this gets more complicated then necessary. For the cases I can think of where you would want different docs boosted, you could just register a different handler with different boosted docs / invariants. This kind of functionality only really makes sense with dismax style user queries rather then standard lucene query syntax. That is "dog" rather then "name:dog^3 content:dog^1" re terminology. Maybe using the word "boost" will get too confusing. Perhaps "elevate", "promote", "force top documents"? rather then the 'QueryBoostingComponent', this could be the DocumentElevationComponent <elevate> <query phrase= "XXXX" > <doc id= "1" /> </query> <query text= "YYYY" > <doc id= "1" /> <doc id= "2" /> </query> </elevate> The fastsearch glossary has a few terms that may be relevant? Absolute boosting Absolute boosting enables a document to be consistently displayed at a given position in the result set when a user searches with a specific query. It also prevents individual documents from being displayed when a user searches with a specific query. Under boosting, they have: Boosting may be applied in two ways: Query independent (document boosting). This is used to boost high quality pages for all queries that match the document Query dependant (query boosting). In this case specific documents may be boosted for given queries Their "Absolute boosting" description makes me wonder if we should add a flag to "burry" or "hide" a document for a given query. maybe: <doc id= "2" hide= "true" />
          Hide
          Yonik Seeley added a comment -

          Is there a way to specify that the file is in the index directory (so it can be replicated out like the rest of the index?)

          Show
          Yonik Seeley added a comment - Is there a way to specify that the file is in the index directory (so it can be replicated out like the rest of the index?)
          Hide
          Ryan McKinley added a comment -

          >
          > Is there a way to specify that the file is in the index directory (so it can be replicated out like the rest of the index?)
          >

          Do we do that anywhere else? Is there / should there be a standard way to do this? I remember you discussing this elsewhere, but I don't know where. external value sources?

          If you put config files in the index directory, how do you handle the empty new index case?

          You get a FileNotFoundException if you have
          /data/index/boosts.xml without an index in that directory

          Show
          Ryan McKinley added a comment - > > Is there a way to specify that the file is in the index directory (so it can be replicated out like the rest of the index?) > Do we do that anywhere else? Is there / should there be a standard way to do this? I remember you discussing this elsewhere, but I don't know where. external value sources? If you put config files in the index directory, how do you handle the empty new index case? You get a FileNotFoundException if you have /data/index/boosts.xml without an index in that directory
          Hide
          Hoss Man added a comment -

          > Is there a way to specify that the file is in the index directory (so it can be replicated
          > out like the rest of the index?)

          that definitely seems like a separate issue that we should attempt to solve on the whole for all type of config files down the road ... it also assumes that this component will reread the file on every newSearcher (i haven't read the patch, but i'm assuming it doesn't)

          Show
          Hoss Man added a comment - > Is there a way to specify that the file is in the index directory (so it can be replicated > out like the rest of the index?) that definitely seems like a separate issue that we should attempt to solve on the whole for all type of config files down the road ... it also assumes that this component will reread the file on every newSearcher (i haven't read the patch, but i'm assuming it doesn't)
          Hide
          Ryan McKinley added a comment -

          updated to work with trunk. added 'forceBoosting="true" argument to force boosting regardless of the requested sort.

          Unless we figure out a way to do absolute positionaing, I think this component should be renamed 'DocumentElevationComponent'

          Show
          Ryan McKinley added a comment - updated to work with trunk. added 'forceBoosting="true" argument to force boosting regardless of the requested sort. Unless we figure out a way to do absolute positionaing, I think this component should be renamed 'DocumentElevationComponent'
          Hide
          Ryan McKinley added a comment -

          I would like to commit most of this patch under SOLR-281. I will leave out the QueryBoostingComponent stuff and just commit the changes to the component framework that make it possible to configure.

          Show
          Ryan McKinley added a comment - I would like to commit most of this patch under SOLR-281 . I will leave out the QueryBoostingComponent stuff and just commit the changes to the component framework that make it possible to configure.
          Hide
          Ryan McKinley added a comment -

          Updated patch for trunk. This also
          1. renames the component 'QueryElevationComponent' and uses the term 'elevate' rather then 'boost'

          2. Implements 'exclude' function

           <query text="ipod">
            <doc id="1" />
            <doc id="MA147LL/A" exclude="true" />
           </query>
          
          Show
          Ryan McKinley added a comment - Updated patch for trunk. This also 1. renames the component 'QueryElevationComponent' and uses the term 'elevate' rather then 'boost' 2. Implements 'exclude' function <query text= "ipod" > <doc id= "1" /> <doc id= "MA147LL/A" exclude= "true" /> </query>
          Hide
          Ryan McKinley added a comment -

          Here is an updated patch that allows you to put the configuration in the data directory and have it reload for each IndexReader.

          Assuming the component is initalized with:
          <str name="config-file">elevate.xml</str>

          If elevate.xml exists within the conf directory it will be loaded once at startup. If it exists within the 'data' directory, it will be reloaded after <commit/>

          Check http://wiki.apache.org/solr/QueryElevationComponent for tentative docs.

          This also refactored the '''getLatestFile'' logic out of o.a.s.search.function.FileFloatSource and put it in a new class: o.a.s.util.VersionedFile

          Show
          Ryan McKinley added a comment - Here is an updated patch that allows you to put the configuration in the data directory and have it reload for each IndexReader. Assuming the component is initalized with: <str name="config-file">elevate.xml</str> If elevate.xml exists within the conf directory it will be loaded once at startup. If it exists within the 'data' directory, it will be reloaded after <commit/> Check http://wiki.apache.org/solr/QueryElevationComponent for tentative docs. This also refactored the '''getLatestFile'' logic out of o.a.s.search.function.FileFloatSource and put it in a new class: o.a.s.util.VersionedFile
          Hide
          Ryan McKinley added a comment -

          Updated to accept a runtime query param "enableElevation" – this can disable elevation.

          Show
          Ryan McKinley added a comment - Updated to accept a runtime query param "enableElevation" – this can disable elevation.
          Hide
          Koji Sekiguchi added a comment -

          I'm interested in this feature and have few comments:

          1. I was bit confused "analyzer" in solrconfig.xml. I thought "fieldType" would be straightforward to me.
          2. Pardon me if I'm wrong, but does elevationCache need to be synchronized in getElevationMap() as it is called from prepare()?

          Show
          Koji Sekiguchi added a comment - I'm interested in this feature and have few comments: 1. I was bit confused "analyzer" in solrconfig.xml. I thought "fieldType" would be straightforward to me. 2. Pardon me if I'm wrong, but does elevationCache need to be synchronized in getElevationMap() as it is called from prepare()?
          Hide
          Ryan McKinley added a comment -

          Thanks Koji – here is an updated patch

          #1 - I change, "analyzer" to "queryFieldType" – this is the fieldType used to analyze the incoming query.

          #2 - I changed it to call synchronized( elevationCache ) when it checks a non-null entry. It does not need to be synchronized with a null key because in this case, the cache is only built on startup.

          To be safe, we could just use:

          final Map<IndexReader,Map<String, ElevationObj>> elevationCache = 
              Collections.synchronizedMap( new WeakHashMap<IndexReader, Map<String,ElevationObj>>() );
          

          but I'm not sure which is better.

          Show
          Ryan McKinley added a comment - Thanks Koji – here is an updated patch #1 - I change, "analyzer" to "queryFieldType" – this is the fieldType used to analyze the incoming query. #2 - I changed it to call synchronized( elevationCache ) when it checks a non-null entry. It does not need to be synchronized with a null key because in this case, the cache is only built on startup. To be safe, we could just use: final Map<IndexReader,Map< String , ElevationObj>> elevationCache = Collections.synchronizedMap( new WeakHashMap<IndexReader, Map< String ,ElevationObj>>() ); but I'm not sure which is better.
          Hide
          Yonik Seeley added a comment -

          Looks good Ryan!
          I reviewed, and changed a few minor things (new patch attached)

          • fixed a concurrency bug (access of map outside of sync can lead to concurrent modification exception or other errors, even if that key/value pair will never change)
          • changed the example example.xml a little, and switched the /elevate handler to load lazily
          • updated code/configs to reflect SearchHandler move
          • fixed (pre-existing) bugs in code moved to VersionedFile (multiple opens of same file)
          • dropped the seemingly unrelated changes in SolrServlet (part of another patch?)
          Show
          Yonik Seeley added a comment - Looks good Ryan! I reviewed, and changed a few minor things (new patch attached) fixed a concurrency bug (access of map outside of sync can lead to concurrent modification exception or other errors, even if that key/value pair will never change) changed the example example.xml a little, and switched the /elevate handler to load lazily updated code/configs to reflect SearchHandler move fixed (pre-existing) bugs in code moved to VersionedFile (multiple opens of same file) dropped the seemingly unrelated changes in SolrServlet (part of another patch?)
          Hide
          Ryan McKinley added a comment -

          Thanks for looking at this - and fixing it up

          dropped the seemingly unrelated changes in SolrServlet (part of another patch?)

          not sure how that got in there.... it was part of an issue I had with resin loading servlets before filters and SOLR-350 initialization.

          Show
          Ryan McKinley added a comment - Thanks for looking at this - and fixing it up dropped the seemingly unrelated changes in SolrServlet (part of another patch?) not sure how that got in there.... it was part of an issue I had with resin loading servlets before filters and SOLR-350 initialization.

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Ryan McKinley
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development