Solr
  1. Solr
  2. SOLR-2754

create Solr similarity factories for new ranking algorithms

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      To make it easy to use some of the new ranking algorithms, we should add factories to solr:

      • for parametric models like LM and BM25 so that parameters can be set from schema.xml
      • for framework models like DFR and IB, so that different basic models/normalizations/lambdas can be chosen
      1. SOLR-2754.patch
        25 kB
        Robert Muir
      2. SOLR-2754.patch
        24 kB
        David Mark Nemeskey
      3. SOLR-2754.patch
        82 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        untested patch, i'll work up tests tomorrow.

        Show
        Robert Muir added a comment - untested patch, i'll work up tests tomorrow.
        Hide
        David Mark Nemeskey added a comment -

        Robert, I've reviewed the patch. Even though I don't have any experience with Solr, the code is very clear, well documented and easy to understand. I have the following observations (or questions, more like):

        1. LMDirichletSimilarity has a mu-less constructor. Maybe we could avoid defining a constant in two places if we used that? E.g.

        mu = params.getFloat("mu");
        ...
        
        LMDirichletSimilarity sim = (mu != null) ? new LMDirichletSimilarity(mu)
                                                 : new LMDirichletSimilarity();
        

        Same goes for H3 and Z.

        2. I think it is a nice feature of the new framework that the user can create new basic models, normalizations, distributions, etc. and just plug them in to DFRSimilarity or IBSimilarity. However, these factories can only handle those that we have defined ourselves. Wouldn't it be good if we could instantiate custom classes via reflection? It could work similarily as in Terrier: keep the current code for core models, and use reflection if the user specifies a (fully specified) classname.

        3. I don't know the Lucene/Solr conventions for line length. There are some rather long lines in IB and DFR, but maybe its not a problem?

        Show
        David Mark Nemeskey added a comment - Robert, I've reviewed the patch. Even though I don't have any experience with Solr, the code is very clear, well documented and easy to understand. I have the following observations (or questions, more like): 1. LMDirichletSimilarity has a mu-less constructor. Maybe we could avoid defining a constant in two places if we used that? E.g. mu = params.getFloat( "mu" ); ... LMDirichletSimilarity sim = (mu != null ) ? new LMDirichletSimilarity(mu) : new LMDirichletSimilarity(); Same goes for H3 and Z. 2. I think it is a nice feature of the new framework that the user can create new basic models, normalizations, distributions, etc. and just plug them in to DFRSimilarity or IBSimilarity . However, these factories can only handle those that we have defined ourselves. Wouldn't it be good if we could instantiate custom classes via reflection? It could work similarily as in Terrier: keep the current code for core models, and use reflection if the user specifies a (fully specified) classname. 3. I don't know the Lucene/Solr conventions for line length. There are some rather long lines in IB and DFR, but maybe its not a problem?
        Hide
        Robert Muir added a comment -

        Thanks for the review David!

        LMDirichletSimilarity has a mu-less constructor. Maybe we could avoid defining a constant in two places if we used that? E.g.

        Same goes for H3 and Z.

        +1, I think it would be good (though probably unavoidable for e.g. BM25's (k1, b) to do it this way if we want to provide default parameters.

        Alternative, another idea would be for all 'parametric' models to require the parameter? Then in the "still-to-be-written" test config
        that tests all these sims, we would just have good default parameters specified? Part of me likes this solution: if you are using a parametric
        model then it requires you to think about it?

        Wouldn't it be good if we could instantiate custom classes via reflection?

        We could add this, e.g. if the parameter doesnt match any of the supplied names. But i started thinking about this, say I created NormalizationRob,
        and it wants a bunch of parameters... at the end of the day for practical purposes a user could just make their own simple factory that uses
        I(F)BRob(2.3, 4.5, 6.99) or whatever they wanted, because I think the intent here is to support all of lucene-core's capabilities?
        Its still pluggable in the sense someone can always make their own factory for their custom stuff.

        I don't know the Lucene/Solr conventions for line length. There are some rather long lines in IB and DFR, but maybe its not a problem?

        Yeah maybe especially for the javadocs, but we can probably re-arrange these.

        Show
        Robert Muir added a comment - Thanks for the review David! LMDirichletSimilarity has a mu-less constructor. Maybe we could avoid defining a constant in two places if we used that? E.g. Same goes for H3 and Z. +1, I think it would be good (though probably unavoidable for e.g. BM25's (k1, b) to do it this way if we want to provide default parameters. Alternative, another idea would be for all 'parametric' models to require the parameter? Then in the "still-to-be-written" test config that tests all these sims, we would just have good default parameters specified? Part of me likes this solution: if you are using a parametric model then it requires you to think about it? Wouldn't it be good if we could instantiate custom classes via reflection? We could add this, e.g. if the parameter doesnt match any of the supplied names. But i started thinking about this, say I created NormalizationRob, and it wants a bunch of parameters... at the end of the day for practical purposes a user could just make their own simple factory that uses I(F)BRob(2.3, 4.5, 6.99) or whatever they wanted, because I think the intent here is to support all of lucene-core's capabilities? Its still pluggable in the sense someone can always make their own factory for their custom stuff. I don't know the Lucene/Solr conventions for line length. There are some rather long lines in IB and DFR, but maybe its not a problem? Yeah maybe especially for the javadocs, but we can probably re-arrange these.
        Hide
        David Mark Nemeskey added a comment -

        Alternative, another idea would be for all 'parametric' models to require the parameter? ... Part of me likes this solution: if you are using a parametric model then it requires you to think about it?

        I can understand the reasoning behind this idea. On the other hand, for some models, the parameter has a value that's optimal in a wide range of cases. In such cases, I think it we could make the life of the user easier by falling back to this value. (Actually, that's why LMJelinekMercerSimilarity does not have a default constructor; there is no single parameter value that is kind-of-optimal in all cases).

        But i started thinking about this, say I created NormalizationRob, and it wants a bunch of parameters...

        Yes, I know, it'd be a bit difficult to support that... maybe if all Similarities and models had a constructor with a map as a parameter? I'm not sure we want that, though.

        I think the intent here is to support all of lucene-core's capabilities?

        In that case let's forget reflection for now.

        Show
        David Mark Nemeskey added a comment - Alternative, another idea would be for all 'parametric' models to require the parameter? ... Part of me likes this solution: if you are using a parametric model then it requires you to think about it? I can understand the reasoning behind this idea. On the other hand, for some models, the parameter has a value that's optimal in a wide range of cases. In such cases, I think it we could make the life of the user easier by falling back to this value. (Actually, that's why LMJelinekMercerSimilarity does not have a default constructor; there is no single parameter value that is kind-of-optimal in all cases). But i started thinking about this, say I created NormalizationRob, and it wants a bunch of parameters... Yes, I know, it'd be a bit difficult to support that... maybe if all Similarities and models had a constructor with a map as a parameter? I'm not sure we want that, though. I think the intent here is to support all of lucene-core's capabilities? In that case let's forget reflection for now.
        Hide
        Robert Muir added a comment -

        I can understand the reasoning behind this idea. On the other hand, for some models, the parameter has a value that's optimal in a wide range of cases. In such cases, I think it we could make the life of the user easier by falling back to this value. (Actually, that's why LMJelinekMercerSimilarity does not have a default constructor; there is no single parameter value that is kind-of-optimal in all cases).

        Well, we can do both: we can provide these basic parameters as default values to be friendly, but at the same time in the test or example xml configurations that use these, our examples can have the parameters set. Even in the JelinekMercer case, our example can also be set to 0.7, because thats the default for long queries and you typically don't use this smoothing for short queries (you would usually use Dirichlet instead), at least that was my reasoning with the default.

        Yes, I know, it'd be a bit difficult to support that... maybe if all Similarities and models had a constructor with a map as a parameter? I'm not sure we want that, though.

        Yeah, I think we want to have hard type-safe apis for the sims themselves, and part of my line of thinking is the case of "I'm going to plug in a custom normalization into DFR" is a pretty expert case for a Solr user at this moment, if you are that expert you could also write a 3 LOC sim factory that sets up your sim with your custom normalization method.

        Show
        Robert Muir added a comment - I can understand the reasoning behind this idea. On the other hand, for some models, the parameter has a value that's optimal in a wide range of cases. In such cases, I think it we could make the life of the user easier by falling back to this value. (Actually, that's why LMJelinekMercerSimilarity does not have a default constructor; there is no single parameter value that is kind-of-optimal in all cases). Well, we can do both: we can provide these basic parameters as default values to be friendly, but at the same time in the test or example xml configurations that use these, our examples can have the parameters set. Even in the JelinekMercer case, our example can also be set to 0.7, because thats the default for long queries and you typically don't use this smoothing for short queries (you would usually use Dirichlet instead), at least that was my reasoning with the default. Yes, I know, it'd be a bit difficult to support that... maybe if all Similarities and models had a constructor with a map as a parameter? I'm not sure we want that, though. Yeah, I think we want to have hard type-safe apis for the sims themselves, and part of my line of thinking is the case of "I'm going to plug in a custom normalization into DFR" is a pretty expert case for a Solr user at this moment, if you are that expert you could also write a 3 LOC sim factory that sets up your sim with your custom normalization method.
        Hide
        David Mark Nemeskey added a comment -

        Well, we can do both: we can provide these basic parameters as default values to be friendly, but at the same time in the test or example xml configurations that use these, our examples can have the parameters set.

        That's a good idea. I could modify the patch if you want to, and also break the long lines into two in the meantime.

        Show
        David Mark Nemeskey added a comment - Well, we can do both: we can provide these basic parameters as default values to be friendly, but at the same time in the test or example xml configurations that use these, our examples can have the parameters set. That's a good idea. I could modify the patch if you want to, and also break the long lines into two in the meantime.
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        David Mark Nemeskey added a comment -

        Done.

        Show
        David Mark Nemeskey added a comment - Done.
        Hide
        Robert Muir added a comment -

        Thanks David, I took your patch and I'm adding tests right now, will upload a new patch soon.

        Show
        Robert Muir added a comment - Thanks David, I took your patch and I'm adding tests right now, will upload a new patch soon.
        Hide
        Robert Muir added a comment -

        i added tests for the new factories: i think its ready to commit.

        Show
        Robert Muir added a comment - i added tests for the new factories: i think its ready to commit.
        Hide
        Robert Muir added a comment -

        Thanks David!

        Show
        Robert Muir added a comment - Thanks David!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development