Solr
  1. Solr
  2. SOLR-1365

Add configurable Sweetspot Similarity factory

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 4.2, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      This is some code that I wrote a while back.

      Normally, if you use SweetSpotSimilarity, you are going to make it do something useful by extending SweetSpotSimilarity. So, instead, I made a factory class and an configurable SweetSpotSimilarty. There are two classes. SweetSpotSimilarityFactory reads the parameters from schema.xml. It then creates an instance of VariableSweetSpotSimilarity, which is my custom SweetSpotSimilarity class. In addition to the standard functions, it also handles dynamic fields.

      So, in schema.xml, you could have something like this:

      <similarity class="org.apache.solr.schema.SweetSpotSimilarityFactory">
      <bool name="useHyperbolicTf">true</bool>

      <float name="hyperbolicTfFactorsMin">1.0</float>
      <float name="hyperbolicTfFactorsMax">1.5</float>
      <float name="hyperbolicTfFactorsBase">1.3</float>
      <float name="hyperbolicTfFactorsXOffset">2.0</float>

      <int name="lengthNormFactorsMin">1</int>
      <int name="lengthNormFactorsMax">1</int>
      <float name="lengthNormFactorsSteepness">0.5</float>

      <int name="lengthNormFactorsMin_description">2</int>
      <int name="lengthNormFactorsMax_description">9</int>
      <float name="lengthNormFactorsSteepness_description">0.2</float>

      <int name="lengthNormFactorsMin_supplierDescription_*">2</int>
      <int name="lengthNormFactorsMax_supplierDescription_*">7</int>
      <float name="lengthNormFactorsSteepness_supplierDescription_*">0.4</float>
      </similarity>

      So, now everything is in a config file instead of having to create your own subclass.

      1. SOLR-1365.patch
        28 kB
        Hoss Man
      2. SOLR-1365.patch
        18 kB
        Hoss Man
      3. SOLR-1365.patch
        7 kB
        Kevin Osborn

        Activity

        Hide
        Erik Hatcher added a comment -

        Sweet!

        Very nice use of the SimilarityFactory capability.

        I took a brief look at the patch, the only feedback I have is that I believe that the dynamic field handling might be able to leverage some of Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get access to that? Hmmm....?

        Show
        Erik Hatcher added a comment - Sweet! Very nice use of the SimilarityFactory capability. I took a brief look at the patch, the only feedback I have is that I believe that the dynamic field handling might be able to leverage some of Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get access to that? Hmmm....?
        Hide
        Erik Hatcher added a comment -

        I took a brief look at the patch, the only feedback I have is that I believe that the dynamic field handling might be able to leverage some of Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get access to that? Hmmm....?

        Why by implementing SolrCoreAware, of course.

        Show
        Erik Hatcher added a comment - I took a brief look at the patch, the only feedback I have is that I believe that the dynamic field handling might be able to leverage some of Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get access to that? Hmmm....? Why by implementing SolrCoreAware, of course.
        Hide
        Kevin Osborn added a comment -

        Thanks for the feedback. I looked at IndexSchema. It seems like the only useful function in my case is using isDynamicField vs. seeing if the field name ends with a "*".

        But also is SimilarityFactory allowed to implement SolrCoreAware? I'm not too familiar with this interface, but my initial research shows that only SolrRequestHandler, QueryResponseWriter, SearchComponent, or UpdateRequestProcessorFactory may implement SolrCoreAware. Is this correct?

        Show
        Kevin Osborn added a comment - Thanks for the feedback. I looked at IndexSchema. It seems like the only useful function in my case is using isDynamicField vs. seeing if the field name ends with a "*". But also is SimilarityFactory allowed to implement SolrCoreAware? I'm not too familiar with this interface, but my initial research shows that only SolrRequestHandler, QueryResponseWriter, SearchComponent, or UpdateRequestProcessorFactory may implement SolrCoreAware. Is this correct?
        Hide
        Erik Hatcher added a comment -

        Any class loaded by SolrResourceLoader (any custom plugin, basically) can implement SolrCoreAware.

        Show
        Erik Hatcher added a comment - Any class loaded by SolrResourceLoader (any custom plugin, basically) can implement SolrCoreAware.
        Hide
        Grant Ingersoll added a comment -

        Needs tests. Not sure this will make 1.4, as we are trying to not add new features at this point.

        Show
        Grant Ingersoll added a comment - Needs tests. Not sure this will make 1.4, as we are trying to not add new features at this point.
        Hide
        Hoss Man added a comment -

        FWIW: if a new feature doesn't have any impact on existing users, and has good tests, then i say we might as well commit it for 1.4

        (If we were talking about a new feature on an existing component, then i'd be hesitant because of how that feature might impact existing users of that component – but in this case even if it has bad performance or some small bug that slips through tests, people have to go out of their way to use it)

        But grant's right: needs tests before it's really a subject for debate.

        Show
        Hoss Man added a comment - FWIW: if a new feature doesn't have any impact on existing users, and has good tests, then i say we might as well commit it for 1.4 (If we were talking about a new feature on an existing component, then i'd be hesitant because of how that feature might impact existing users of that component – but in this case even if it has bad performance or some small bug that slips through tests, people have to go out of their way to use it) But grant's right: needs tests before it's really a subject for debate.
        Hide
        Kevin Osborn added a comment -

        Thanks for the comments. I'll make the changes for Erik's suggestions and come up with some tests. If it gets into 1.4, great. If not, then it is not a huge deal since this is already production code for us. But, if it could be put into the main code base, then even better.

        Show
        Kevin Osborn added a comment - Thanks for the comments. I'll make the changes for Erik's suggestions and come up with some tests. If it gets into 1.4, great. If not, then it is not a huge deal since this is already production code for us. But, if it could be put into the main code base, then even better.
        Hide
        Kevin Osborn added a comment -

        I am finally getting back around to this. And I am having trouble implementing SolrCoreAware. As The SolrResourceLoader has a method called assertAwareCompatibility which throws an exception my class does not extend SolrRequestHandler, QueryResponseWriter, SearchComponent, or UpdateRequestProcessorFactory. Am I missing anything?

        Show
        Kevin Osborn added a comment - I am finally getting back around to this. And I am having trouble implementing SolrCoreAware. As The SolrResourceLoader has a method called assertAwareCompatibility which throws an exception my class does not extend SolrRequestHandler, QueryResponseWriter, SearchComponent, or UpdateRequestProcessorFactory. Am I missing anything?
        Hide
        Erik Hatcher added a comment -

        I'm not really sure why we have that constraint in SolrResourceLoader, and why any class we load can't simply implement SolrCoreAware. But at the very least, we can update this to support a SimilarityFactory for the sake of this issue. +1

        Show
        Erik Hatcher added a comment - I'm not really sure why we have that constraint in SolrResourceLoader, and why any class we load can't simply implement SolrCoreAware. But at the very least, we can update this to support a SimilarityFactory for the sake of this issue. +1
        Hide
        Hoss Man added a comment -

        The constraints on what can be SolrCoreAware exist for two main reasons:

        1. to ensure some sanity in initialization .. one of the main reasons the SolrCoreAware interface was needed in the first place was because some plugins wanted to use the SolrCore to get access to other plugins during their initialization – but those other components weren't necessarily initialized yet. with the inform(SolrCore) method SolrCoreAware plugins know that all other components have been initialized, but they haven't necessarily been informed about the SolrCore, so they might not be "ready" to deal with other plugins yet ... it's generally just a big initialization-cluster-fuck, so the fewer classes involved the better
        2. prevent too much pollution of the SolrCore API. having direct access to the SolrCore is "a big deal" – once you have a reference to the core, you can get to pretty much anything, which opens us (ie: Solr maintainers) up to a lot of crazy code paths to worry about – so the fewer plugin types that we need to consider when making changes to SolrCore the better.

        In the case of SimilarityFactor, i'm not entirely sure how i feel about making it SolrCoreAware(able) ... we have tried really, REALLY hard to make sure nothing initialized as part of the IndexSchema can be SolrCore aware because it opens up the possibility of plugin behavior being affected by SolrCore configuration which might be differnet between master and slave machines – which could provide disastrous results. a schema.xml needs to be internally consistent regardless of what solrconfig.xml might refrence it.

        In this case the real issue isn't that we have a use case where SImilarityFactory needs access to SolrCore – what it wants access to is the IndexSchema, so it might make sense to just provide access to that in some way w/o having to expos the entire SolrCore.

        Practically speaking, after re-skimming the patch: I'm not even convinced that would eally add anything. refactoring/reusing some of the code that IndexSchema uses to manage dynamicFIelds might be handy for the SweetSpotSimilarityFactory, but i don't actual see how being able to inspect the IndexSchema to get the list of dynamicFields (or find out if a field is dynamic) would make it any better or easier to use. We'd still want people to configure it with field names and field name globs directly because there won't necessarily be a one to one correspondence between what fields are dynamic in the schema and how you want the sweetspots defined ... you might have a generic "en_" dynamicField in your schema for english text, and an "fr_" dynamicField for french text, but that doesn't mean the sweetspot for all "fr_" fields will be the same ... you are just as likely to want some very specific field names to have their own sweetspot, or to have the sweetspot be suffix based (ie: "_title" could have one sweetspot even the resulting field names are fr_title and en_title.

        I think the patch could be improved, and i think there is definitely some code reuse possibility for parsing the field name globs, but i don't know that it really needs run time access to the IndexSchema (and it definitely doesn't need access to the SolrCore)

        Show
        Hoss Man added a comment - The constraints on what can be SolrCoreAware exist for two main reasons: to ensure some sanity in initialization .. one of the main reasons the SolrCoreAware interface was needed in the first place was because some plugins wanted to use the SolrCore to get access to other plugins during their initialization – but those other components weren't necessarily initialized yet. with the inform(SolrCore) method SolrCoreAware plugins know that all other components have been initialized, but they haven't necessarily been informed about the SolrCore, so they might not be "ready" to deal with other plugins yet ... it's generally just a big initialization-cluster-fuck, so the fewer classes involved the better prevent too much pollution of the SolrCore API. having direct access to the SolrCore is "a big deal" – once you have a reference to the core, you can get to pretty much anything, which opens us (ie: Solr maintainers) up to a lot of crazy code paths to worry about – so the fewer plugin types that we need to consider when making changes to SolrCore the better. In the case of SimilarityFactor, i'm not entirely sure how i feel about making it SolrCoreAware(able) ... we have tried really, REALLY hard to make sure nothing initialized as part of the IndexSchema can be SolrCore aware because it opens up the possibility of plugin behavior being affected by SolrCore configuration which might be differnet between master and slave machines – which could provide disastrous results. a schema.xml needs to be internally consistent regardless of what solrconfig.xml might refrence it. In this case the real issue isn't that we have a use case where SImilarityFactory needs access to SolrCore – what it wants access to is the IndexSchema, so it might make sense to just provide access to that in some way w/o having to expos the entire SolrCore. Practically speaking, after re-skimming the patch: I'm not even convinced that would eally add anything. refactoring/reusing some of the code that IndexSchema uses to manage dynamicFIelds might be handy for the SweetSpotSimilarityFactory, but i don't actual see how being able to inspect the IndexSchema to get the list of dynamicFields (or find out if a field is dynamic) would make it any better or easier to use. We'd still want people to configure it with field names and field name globs directly because there won't necessarily be a one to one correspondence between what fields are dynamic in the schema and how you want the sweetspots defined ... you might have a generic "en_ " dynamicField in your schema for english text, and an "fr_ " dynamicField for french text, but that doesn't mean the sweetspot for all "fr_ " fields will be the same ... you are just as likely to want some very specific field names to have their own sweetspot, or to have the sweetspot be suffix based (ie: " _title" could have one sweetspot even the resulting field names are fr_title and en_title. I think the patch could be improved, and i think there is definitely some code reuse possibility for parsing the field name globs, but i don't know that it really needs run time access to the IndexSchema (and it definitely doesn't need access to the SolrCore)
        Hide
        Hoss Man added a comment -

        Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

        A unique token for finding these 240 issues in the future: hossversioncleanup20100527

        Show
        Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
        Hide
        Jan Høydahl added a comment -

        Just pinging this issue to see if anyone picks it up again Looks useful!

        Show
        Jan Høydahl added a comment - Just pinging this issue to see if anyone picks it up again Looks useful!
        Hide
        Robert Muir added a comment -

        In my opinion the issue should be much simpler to solve in trunk now, as all the per-field stuff is now removed from SweetSpotSimilarity.

        Instead, you would just have the configurable sweetspot similarity, and assign different configurations to different fields via SOLR-2338, which will be responsible for the indexschema integration.

        Show
        Robert Muir added a comment - In my opinion the issue should be much simpler to solve in trunk now, as all the per-field stuff is now removed from SweetSpotSimilarity. Instead, you would just have the configurable sweetspot similarity, and assign different configurations to different fields via SOLR-2338 , which will be responsible for the indexschema integration.
        Hide
        Kevin Osborn added a comment -

        I had almost forgotten about this issue. I should be able to wrap this up soon.

        Show
        Kevin Osborn added a comment - I had almost forgotten about this issue. I should be able to wrap this up soon.
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Robert Muir added a comment -

        3.4 -> 3.5

        Show
        Robert Muir added a comment - 3.4 -> 3.5
        Hide
        Hoss Man added a comment -

        Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

        email notification suppressed to prevent mass-spam
        psuedo-unique token identifying these issues: hoss20120321nofix36

        Show
        Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
        Hide
        Hoss Man added a comment -

        Ok, here's an all new patch for the post SOLR-2338 world order.

        Example syntax...

            <!-- using baseline TF -->
            <fieldType name="text_baseline" class="solr.TextField"
                       indexed="true" stored="false">
              <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
              <similarity class="solr.SweetSpotSimilarityFactory">
                <!-- TF -->
                <float name="baselineTfMin">6.0</float>
                <float name="baselineTfBase">1.5</float>
                <!-- plateau norm -->
                <int name="lengthNormMin">3</int>
                <int name="lengthNormMax">5</int>
                <float name="lengthNormSteepness">0.5</float>
              </similarity>
            </fieldType>
           
            <!-- using hyperbolic TF -->
            <fieldType name="text_hyperbolic" class="solr.TextField"
                       indexed="true" stored="false" >
              <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
              <similarity class="solr.SweetSpotSimilarityFactory">
                <float name="hyperbolicTfMin">3.3</float>
                <float name="hyperbolicTfMax">7.7</float>
                <double name="hyperbolicTfBase">2.718281828459045</double> <!-- e -->
                <float name="hyperbolicTfOffset">5.0</float>
                <!-- plateau norm, shallower slope -->
                <int name="lengthNormMin">1</int>
                <int name="lengthNormMax">5</int>
                <float name="lengthNormSteepness">0.2</float>
              </similarity>
            </fieldType>
        

        (it automatically detects wether to use hyperbolic or baseline tf depending on which settings are used)

        Anyone have any concerns?

        Show
        Hoss Man added a comment - Ok, here's an all new patch for the post SOLR-2338 world order. Example syntax... <!-- using baseline TF --> <fieldType name= "text_baseline" class= "solr.TextField" indexed= " true " stored= " false " > <analyzer class= "org.apache.lucene.analysis.standard.StandardAnalyzer" /> <similarity class= "solr.SweetSpotSimilarityFactory" > <!-- TF --> < float name= "baselineTfMin" >6.0</ float > < float name= "baselineTfBase" >1.5</ float > <!-- plateau norm --> < int name= "lengthNormMin" >3</ int > < int name= "lengthNormMax" >5</ int > < float name= "lengthNormSteepness" >0.5</ float > </similarity> </fieldType> <!-- using hyperbolic TF --> <fieldType name= "text_hyperbolic" class= "solr.TextField" indexed= " true " stored= " false " > <analyzer class= "org.apache.lucene.analysis.standard.StandardAnalyzer" /> <similarity class= "solr.SweetSpotSimilarityFactory" > < float name= "hyperbolicTfMin" >3.3</ float > < float name= "hyperbolicTfMax" >7.7</ float > < double name= "hyperbolicTfBase" >2.718281828459045</ double > <!-- e --> < float name= "hyperbolicTfOffset" >5.0</ float > <!-- plateau norm, shallower slope --> < int name= "lengthNormMin" >1</ int > < int name= "lengthNormMax" >5</ int > < float name= "lengthNormSteepness" >0.2</ float > </similarity> </fieldType> (it automatically detects wether to use hyperbolic or baseline tf depending on which settings are used) Anyone have any concerns?
        Hide
        Hoss Man added a comment -

        Updated patch (Forgot to include some "bad" config tests)

        Show
        Hoss Man added a comment - Updated patch (Forgot to include some "bad" config tests)
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Chris M. Hostetter
        http://svn.apache.org/viewvc?view=revision&revision=1450577

        SOLR-1365: New SweetSpotSimilarityFactory allows customizable TF/IDF based Similarity when you know the optimal "Sweet Spot" of values for the field length and TF scoring factors

        Show
        Commit Tag Bot added a comment - [trunk commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1450577 SOLR-1365 : New SweetSpotSimilarityFactory allows customizable TF/IDF based Similarity when you know the optimal "Sweet Spot" of values for the field length and TF scoring factors
        Hide
        Hoss Man added a comment -

        Committed revision 1450577.
        Committed revision 1450579.

        Show
        Hoss Man added a comment - Committed revision 1450577. Committed revision 1450579.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Chris M. Hostetter
        http://svn.apache.org/viewvc?view=revision&revision=1450579

        SOLR-1365: New SweetSpotSimilarityFactory allows customizable TF/IDF based Similarity when you know the optimal "Sweet Spot" of values for the field length and TF scoring factors (merge r1450577)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1450579 SOLR-1365 : New SweetSpotSimilarityFactory allows customizable TF/IDF based Similarity when you know the optimal "Sweet Spot" of values for the field length and TF scoring factors (merge r1450577)
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Hoss Man
            Reporter:
            Kevin Osborn
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development