Lucene - Core
  1. Lucene - Core
  2. LUCENE-2959

[GSoC] Implementing State of the Art Ranking for Lucene

    Details

    • Lucene Fields:
      New

      Description

      Lucene employs the Vector Space Model (VSM) to rank documents, which compares
      unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is
      tailored specically to VSM, which makes the addition of new ranking functions a non-
      trivial task.

      This project aims to bring state of the art ranking methods to Lucene and to implement a
      query architecture with pluggable ranking functions.

      The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

      1. implementation_plan.pdf
        49 kB
        David Mark Nemeskey
      2. LUCENE-2959_mockdfr.patch
        8 kB
        Robert Muir
      3. LUCENE-2959_nocommits.patch
        22 kB
        Robert Muir
      4. LUCENE-2959.patch
        435 kB
        Robert Muir
      5. LUCENE-2959.patch
        539 kB
        Robert Muir
      6. proposal.pdf
        85 kB
        David Mark Nemeskey

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          You are probably familiar with this paper and the code... just in case I'm adding a reference here: http://arxiv.org/abs/0911.5046

          Show
          Andrzej Bialecki added a comment - You are probably familiar with this paper and the code... just in case I'm adding a reference here: http://arxiv.org/abs/0911.5046
          Hide
          David Mark Nemeskey added a comment -

          Andrzej: thanks! Indeed, I have read that that paper, but have only skimmed through the code. I am also aware of at least one BM25 implementation for Lucene, which may or may not be what issue LUCENE-2091 is about. I need to have a look into it.

          Show
          David Mark Nemeskey added a comment - Andrzej: thanks! Indeed, I have read that that paper, but have only skimmed through the code. I am also aware of at least one BM25 implementation for Lucene, which may or may not be what issue LUCENE-2091 is about. I need to have a look into it.
          Hide
          David Mark Nemeskey added a comment -

          Added a link to the issue mentioned in my last comment (and yet another one). The relation between them is not exactly clear for me yet, should consult with Robert first.

          Show
          David Mark Nemeskey added a comment - Added a link to the issue mentioned in my last comment (and yet another one). The relation between them is not exactly clear for me yet, should consult with Robert first.
          Hide
          Robert Muir added a comment -

          Hi David, to try to help get things moving, I created a branch: https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring

          This is the in-progress work from LUCENE-2392, which separates the scoring calculates from the postings-list matching. In short, Similarity becomes very low-level, but the idea is that you extend it to present a higher-level API (for example TFIDFSimilarity: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java) that is user-friendly and allows users to adjust parameters in a way that makes sense to that scoring system.

          As a start I implemented some very very rough/basic models in src/test:
          BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java

          Dirichlet LM: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java

          But these are in no way correct or extensible or nice. For example, the BM25 similarity is slow, because as implemented its "average document length" is "live" (e.g. if you add more segments its immediately adjusted for each query)... there is no caching at all.

          For example in this case, to speed up BM25, it could be nice for the Similarity to pull this up-front, and create cached calculations. If a user wants to refresh their bm25 stats then they could do something in SimilarityProvider to call it to recalculate the caches.

          However, for a user that wants "super-realtime" view, it might be better for it to stay the way it is now, or alternatively for the Sim to up-front do the 256 calculations per query (ideally in weight, not per-segment in docscorer) to tableize the length normalizations.

          So these are the API challenges we need to consider if we want to provide actual implementations of these scoring systems: how to make them perform close to or as fast as lucene's current scoring model.

          Separately on the issue, I want to make Weight completely opaque to the sim, really its just a way for a Similarity to compute things up front (such as IDF, but maybe things like these bm25 length norm caches too). Currently it can only have a single float value (see my un-sqrt'ing and other hacks in the Mock sims), so this should be fixed.

          Additionally another big TODO: just as Scorer was split (maybe we should rename it to Matcher now that sim does the calcs?), the process of Explanations need to be split too, where a Sim is completely responsible for explaining itself.

          Another TODO i have is to write the norm summation into the norms file as a single vlong, rather than computing it across all byte[] in segmentreader like I do now... I just implemented it this way so that we could play with scoring algorithms easily.

          So, the good news would be that scoring is a lot more flexible, but the "bad" news is that in order to support lucene's features, implementing a new ranking system on top of Similarity is really serious work, as you need to:

          1. implement the lower-level API efficiently, yet expose a nice high-level API such as TFIDFSimilarity's tf() and idf() hooks for users.
          2. implement explanations so that users can debug relevance issues.
          3. think about allowing users to balance the various performance tradeoffs, such as balancing the performance gained by caching things versus using realtime statistics (some of this could be in my head, maybe computing 256 norm decoder caches up-front is really cheap and a non-issue).
          4. consider how to integrate lucene's features into the ranking system, for example how to estimate a reasonable "phrase IDF" for phrase/multiphrase/span queries, how to integrate index-time boosts (in my example BM25 etc, I just made the documents appear shorter to accomplish this), depending upon how the length normalization is being stored in the index, how to pick the best quantization (might not be SmallFloat352), etc etc.
          5. do all the relevance testing to ensure that things are correct (i found lots of bugs doing rough testing on my Mock ones, there are probably more!, but on the couple test collections i tried they seemed reasonable)
          6. adding good quality documentation such as what we have today in TFIDFSimilarity that explains how the ranking system works and how you can tune it.
          Show
          Robert Muir added a comment - Hi David, to try to help get things moving, I created a branch: https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring This is the in-progress work from LUCENE-2392 , which separates the scoring calculates from the postings-list matching. In short, Similarity becomes very low-level, but the idea is that you extend it to present a higher-level API (for example TFIDFSimilarity: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java ) that is user-friendly and allows users to adjust parameters in a way that makes sense to that scoring system. As a start I implemented some very very rough/basic models in src/test: BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java Dirichlet LM: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java But these are in no way correct or extensible or nice. For example, the BM25 similarity is slow, because as implemented its "average document length" is "live" (e.g. if you add more segments its immediately adjusted for each query)... there is no caching at all. For example in this case, to speed up BM25, it could be nice for the Similarity to pull this up-front, and create cached calculations. If a user wants to refresh their bm25 stats then they could do something in SimilarityProvider to call it to recalculate the caches. However, for a user that wants "super-realtime" view, it might be better for it to stay the way it is now, or alternatively for the Sim to up-front do the 256 calculations per query (ideally in weight, not per-segment in docscorer) to tableize the length normalizations. So these are the API challenges we need to consider if we want to provide actual implementations of these scoring systems: how to make them perform close to or as fast as lucene's current scoring model. Separately on the issue, I want to make Weight completely opaque to the sim, really its just a way for a Similarity to compute things up front (such as IDF, but maybe things like these bm25 length norm caches too). Currently it can only have a single float value (see my un-sqrt'ing and other hacks in the Mock sims), so this should be fixed. Additionally another big TODO: just as Scorer was split (maybe we should rename it to Matcher now that sim does the calcs?), the process of Explanations need to be split too, where a Sim is completely responsible for explaining itself. Another TODO i have is to write the norm summation into the norms file as a single vlong, rather than computing it across all byte[] in segmentreader like I do now... I just implemented it this way so that we could play with scoring algorithms easily. So, the good news would be that scoring is a lot more flexible, but the "bad" news is that in order to support lucene's features, implementing a new ranking system on top of Similarity is really serious work, as you need to: implement the lower-level API efficiently, yet expose a nice high-level API such as TFIDFSimilarity's tf() and idf() hooks for users. implement explanations so that users can debug relevance issues. think about allowing users to balance the various performance tradeoffs, such as balancing the performance gained by caching things versus using realtime statistics (some of this could be in my head, maybe computing 256 norm decoder caches up-front is really cheap and a non-issue). consider how to integrate lucene's features into the ranking system, for example how to estimate a reasonable "phrase IDF" for phrase/multiphrase/span queries, how to integrate index-time boosts (in my example BM25 etc, I just made the documents appear shorter to accomplish this), depending upon how the length normalization is being stored in the index, how to pick the best quantization (might not be SmallFloat352), etc etc. do all the relevance testing to ensure that things are correct (i found lots of bugs doing rough testing on my Mock ones, there are probably more!, but on the couple test collections i tried they seemed reasonable) adding good quality documentation such as what we have today in TFIDFSimilarity that explains how the ranking system works and how you can tune it.
          Hide
          Robert Muir added a comment -

          David, for your perusal here is another sim i tried to write: DFR I(F)L2

          its probably got bugs, but demonstrates again the challenges here.

          If we want to support ranking systems like this, how can they be made fast?

          The one i wrote has no score caching, so it does a lot of per-document divisions, multiplications, etc and this is no good.

          So its gonna be hard to make these have competitive performance with lucene's current scoring, which for TF < 32 is an array lookup and a single multiplication.

          Its more obvious to me how to eek good performance from the language modelling formula because you can re-arrange the log and boil it down to some addition, but we need to get creative thinking about how to make some of these other models fast, and its more complicated if you want to make say a dfr "framework" that allows you to pick basic model and the 2 normalizations, versus specializing the code for each possibility (and there are many).

          My advice to you for GSOC would be to just pick one of these (e.g. BM25) and figure out how to do it really well, good performance, good api and documentation, and good relevance testing to ensure its quality.

          I'm more than happy to help with the boring parts like refactoring lucene's Explanations API

          Show
          Robert Muir added a comment - David, for your perusal here is another sim i tried to write: DFR I(F)L2 its probably got bugs, but demonstrates again the challenges here. If we want to support ranking systems like this, how can they be made fast? The one i wrote has no score caching, so it does a lot of per-document divisions, multiplications, etc and this is no good. So its gonna be hard to make these have competitive performance with lucene's current scoring, which for TF < 32 is an array lookup and a single multiplication. Its more obvious to me how to eek good performance from the language modelling formula because you can re-arrange the log and boil it down to some addition, but we need to get creative thinking about how to make some of these other models fast, and its more complicated if you want to make say a dfr "framework" that allows you to pick basic model and the 2 normalizations, versus specializing the code for each possibility (and there are many). My advice to you for GSOC would be to just pick one of these (e.g. BM25) and figure out how to do it really well, good performance, good api and documentation, and good relevance testing to ensure its quality. I'm more than happy to help with the boring parts like refactoring lucene's Explanations API
          Hide
          David Mark Nemeskey added a comment -

          Robert: thanks for all the info! It's nice to see so much work has already been done. I plan to delve into it after the selection, and try to get other things out of the way until then, so that I can concentrate on GSoC during the summer.

          I think the main point would be to make the addition of a new ranking function as easy as possible. At least a prototype implementation should be very straightforward, even at the expense of performance. Then, if the new method provides good results, the developer can go on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this without knowing the code, of course, but do you think it is possible?

          Even though I added a "Performance" section to my proposal (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1), I see now that it's probably more important than I believed it to be at first. I think I will follow your advice and concentrate on how to make BM25F fast. It may be a bit tougher nut to crack than DFR, as the latter has logarithms scattered all over it. However, the first thing that comes to mind is that the tf-BM25 curve becomes almost flat very quickly (less so for a high k1 value, though). So it may be possible to pre-compute a tf map or array for a query.

          Show
          David Mark Nemeskey added a comment - Robert: thanks for all the info! It's nice to see so much work has already been done. I plan to delve into it after the selection, and try to get other things out of the way until then, so that I can concentrate on GSoC during the summer. I think the main point would be to make the addition of a new ranking function as easy as possible. At least a prototype implementation should be very straightforward, even at the expense of performance. Then, if the new method provides good results, the developer can go on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this without knowing the code, of course, but do you think it is possible? Even though I added a "Performance" section to my proposal ( http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1 ), I see now that it's probably more important than I believed it to be at first. I think I will follow your advice and concentrate on how to make BM25F fast. It may be a bit tougher nut to crack than DFR, as the latter has logarithms scattered all over it. However, the first thing that comes to mind is that the tf-BM25 curve becomes almost flat very quickly (less so for a high k1 value, though). So it may be possible to pre-compute a tf map or array for a query.
          Hide
          Robert Muir added a comment -

          I think the main point would be to make the addition of a new ranking function as easy as possible. At least a prototype implementation should be very straightforward, even at the expense of performance. Then, if the new method provides good results, the developer can go on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this without knowing the code, of course, but do you think it is possible?

          This sounds great! For example, you could extend the low-level api, gather every possible statistic that lucene has, and present a high-level api that looks more like terrier's scoring api (which i'm guessing is what researchers would prefer?), where they basically implement the scoring in one method with all the stats there.

          So someone would extend this API to do prototyping, it would make it easier to experiment.

          I think I will follow your advice and concentrate on how to make BM25F fast.

          Actually as far as BM25f, this one presents a few challenges (some already discussed on LUCENE-2091).

          To summarize:

          • for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all the fields". (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a "static" schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties).
          • the same issue applies to length normalization, lucene has a "field length" but really no concept of document length.

          So I just wanted to mention that while its possible here to apply a per-field TF boost before the non-linear TF saturation, its not immediately clear how to adjust the BM25f formula to lucene: how to combine these scores without using a (wasteful) "catch-all-field" and some lying behind the scenes to force this catch-all-field's length normalization and docFreq to be used.

          Too many questions arise for BM25f and how it would "fit" with lucene, for example the fact that "multiple fields" can really mean anything, and having a field in lucene doesnt mean at all that it was in your original document! For example, Solr users frequently use a "copyField" to take the content of one field, duplicate it to a different field (and perhaps apply some processing). In terms of things like length normalization, it seems that "document length" calculated as the sum across the fields would be wrong for many use cases.

          I only wanted to recommend against this one because of this rather serious challenge, it seems its something we might want to table at the moment: lucene is changing fast and as new capabilities arise, we might realize there is a more elegant way to address this... but at the moment I think I would recommend starting with BM25.

          Show
          Robert Muir added a comment - I think the main point would be to make the addition of a new ranking function as easy as possible. At least a prototype implementation should be very straightforward, even at the expense of performance. Then, if the new method provides good results, the developer can go on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this without knowing the code, of course, but do you think it is possible? This sounds great! For example, you could extend the low-level api, gather every possible statistic that lucene has, and present a high-level api that looks more like terrier's scoring api (which i'm guessing is what researchers would prefer?), where they basically implement the scoring in one method with all the stats there. So someone would extend this API to do prototyping, it would make it easier to experiment. I think I will follow your advice and concentrate on how to make BM25F fast. Actually as far as BM25f, this one presents a few challenges (some already discussed on LUCENE-2091 ). To summarize: for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all the fields". (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a "static" schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties). the same issue applies to length normalization, lucene has a "field length" but really no concept of document length. So I just wanted to mention that while its possible here to apply a per-field TF boost before the non-linear TF saturation, its not immediately clear how to adjust the BM25f formula to lucene: how to combine these scores without using a (wasteful) "catch-all-field" and some lying behind the scenes to force this catch-all-field's length normalization and docFreq to be used. Too many questions arise for BM25f and how it would "fit" with lucene, for example the fact that "multiple fields" can really mean anything, and having a field in lucene doesnt mean at all that it was in your original document! For example, Solr users frequently use a "copyField" to take the content of one field, duplicate it to a different field (and perhaps apply some processing). In terms of things like length normalization, it seems that "document length" calculated as the sum across the fields would be wrong for many use cases. I only wanted to recommend against this one because of this rather serious challenge, it seems its something we might want to table at the moment: lucene is changing fast and as new capabilities arise, we might realize there is a more elegant way to address this... but at the moment I think I would recommend starting with BM25.
          Hide
          David Mark Nemeskey added a comment -

          Robert,

          As for the problems with BM25F

          • for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all the fields".
          • the same issue applies to length normalization, lucene has a "field length" but really no concept of document length.

          One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document – which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement.

          (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a "static" schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties).

          Actually I am not sure there is a consensus on what BM25F actually is. For example, the BM25 formula can be applied to the weighted sum of field tfs, or alternatively, the per-field BM25 scores can be summarized as well after normalization. I've seen both called (maybe incorrectly) BM25F.

          If I understand correctly, the current scoring algorithm takes into account only the fields explicitly specified in the query. Is that right? If so, I see no reason why BM25 should behave otherwise. Which of course also means that we probably won't be able to save the summarized doc length and idf.

          Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1. It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by:

          • deleting the conversion part?
          • committing myself to BM25/BM25F only?
          • explicitly stating that I want a higher level API based on the low-level one?

          As for the last item, it is only if I continue / join the work in 2392. Since I guess nobody wants two ranking frameworks, of course I will, but then in this part of the proposal should I just concentrate on the higher level API?

          Thanks!

          Show
          David Mark Nemeskey added a comment - Robert, As for the problems with BM25F for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all the fields". the same issue applies to length normalization, lucene has a "field length" but really no concept of document length. One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document – which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement. (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a "static" schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties). Actually I am not sure there is a consensus on what BM25F actually is. For example, the BM25 formula can be applied to the weighted sum of field tfs, or alternatively, the per-field BM25 scores can be summarized as well after normalization. I've seen both called (maybe incorrectly) BM25F. If I understand correctly, the current scoring algorithm takes into account only the fields explicitly specified in the query. Is that right? If so, I see no reason why BM25 should behave otherwise. Which of course also means that we probably won't be able to save the summarized doc length and idf. Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1 . It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by: deleting the conversion part? committing myself to BM25/BM25F only? explicitly stating that I want a higher level API based on the low-level one? As for the last item, it is only if I continue / join the work in 2392. Since I guess nobody wants two ranking frameworks, of course I will, but then in this part of the proposal should I just concentrate on the higher level API? Thanks!
          Hide
          Robert Muir added a comment -

          One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document – which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement.

          A variant sounds really interesting? I think you know better than me here, I just looked at the original paper and thought to myself that to implement this "by the book" might not be feasible for a while.

          Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1. It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by:

          deleting the conversion part?
          committing myself to BM25/BM25F only?
          explicitly stating that I want a higher level API based on the low-level one?

          I think you can decide what you want to do? Obviously I would love to see all of it done

          But its your choice, I could see you going a couple different ways:

          • closer to your original proposal, you could still develop a flexible scoring API on top of Similarity. Hey, all I did was move stuff from Scorer to Similarity really, which does give flexibility, but its probably not what an IR researcher would want (its low-level and confusing). So you could make a "SimpleSimilarity" or "EasySimilarity" or something thats presents a much simpler API (something closer to what terrier/indri present) on top of this, for easily implementing ranking functions? I think this would be extremely valuable long-term: who cares if we have a low-level flexible scoring API that only speed demons like, but IR practitioners find confusing and hideous? Someone who is trying to experiment with an enhancement to relevance likely doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the API is really easy and they aren't wasting time fighting with lucene? If you go this route, you could implement BM25, DFR, etc as you suggested as examples to how to use this API, and there would be more of a focus on API quality and simplicity instead of performance.
          • or alternatively, you could refine your proposal to implement a really "production strength" version of one of these scoring systems on top of the low-level API, that would ideally have competitive performance/documentation/etc with Lucene's default scoring today. If you decide to do this, then yes, I would definitely suggest picking only one, because I think its a ton of work as I listed above, and I think there would be more focus on practical things (some probably being nuances of lucene) and performance.
          Show
          Robert Muir added a comment - One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document – which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement. A variant sounds really interesting? I think you know better than me here, I just looked at the original paper and thought to myself that to implement this "by the book" might not be feasible for a while. Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1 . It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by: deleting the conversion part? committing myself to BM25/BM25F only? explicitly stating that I want a higher level API based on the low-level one? I think you can decide what you want to do? Obviously I would love to see all of it done But its your choice, I could see you going a couple different ways: closer to your original proposal, you could still develop a flexible scoring API on top of Similarity. Hey, all I did was move stuff from Scorer to Similarity really, which does give flexibility, but its probably not what an IR researcher would want (its low-level and confusing). So you could make a "SimpleSimilarity" or "EasySimilarity" or something thats presents a much simpler API (something closer to what terrier/indri present) on top of this, for easily implementing ranking functions? I think this would be extremely valuable long-term: who cares if we have a low-level flexible scoring API that only speed demons like, but IR practitioners find confusing and hideous? Someone who is trying to experiment with an enhancement to relevance likely doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the API is really easy and they aren't wasting time fighting with lucene? If you go this route, you could implement BM25, DFR, etc as you suggested as examples to how to use this API, and there would be more of a focus on API quality and simplicity instead of performance. or alternatively, you could refine your proposal to implement a really "production strength" version of one of these scoring systems on top of the low-level API, that would ideally have competitive performance/documentation/etc with Lucene's default scoring today. If you decide to do this, then yes, I would definitely suggest picking only one, because I think its a ton of work as I listed above, and I think there would be more focus on practical things (some probably being nuances of lucene) and performance.
          Hide
          David Mark Nemeskey added a comment -

          I think you can decide what you want to do?

          Fair enough. I guess I'll stick with my original proposal then, though I might change a few things here and there; maybe change the focus from flexibility (as it seems to be already underway) to simplicity.

          Show
          David Mark Nemeskey added a comment - I think you can decide what you want to do? Fair enough. I guess I'll stick with my original proposal then, though I might change a few things here and there; maybe change the focus from flexibility (as it seems to be already underway) to simplicity.
          Hide
          Robert Muir added a comment -

          setting myself as assignee as i'd like to mentor this one.

          Show
          Robert Muir added a comment - setting myself as assignee as i'd like to mentor this one.
          Hide
          David Mark Nemeskey added a comment -

          Thanks Robert, that would be terrific.

          Show
          David Mark Nemeskey added a comment - Thanks Robert, that would be terrific.
          Hide
          David Mark Nemeskey added a comment -

          Robert: maybe we could resolve this issue as well? Once we decide what to do with 3173 – perhaps a won'tfix?

          Show
          David Mark Nemeskey added a comment - Robert: maybe we could resolve this issue as well? Once we decide what to do with 3173 – perhaps a won'tfix?
          Hide
          Robert Muir added a comment -

          I think we can defer that one or just leave it open for consideration later.

          As far as this issue, lets keep it open until we merge branch to trunk!

          Show
          Robert Muir added a comment - I think we can defer that one or just leave it open for consideration later. As far as this issue, lets keep it open until we merge branch to trunk!
          Hide
          Robert Muir added a comment -

          I rearranged the BM25 in the branch a little bit, its now as fast as lucene's ranking formula:

                          Task   QPS tfidf StdDev tfidf   QPS bm25 StdDev bm25      Pct diff
                      SpanNear        4.29        0.52        4.14        0.49  -24% -   22%
                        Phrase        3.97        0.25        3.89        0.25  -13% -   11%
                          Term       82.18        4.78       81.00        2.56   -9% -    7%
                TermBGroup1M1P       83.30        2.41       82.12        2.20   -6% -    4%
                  SloppyPhrase        8.03        0.31        7.93        0.43  -10% -    8%
                   AndHighHigh       19.38        0.59       19.16        0.71   -7% -    5%
                      PKLookup      175.49        4.33      173.67        4.20   -5% -    3%
                    AndHighMed       40.99        1.12       40.71        1.07   -5% -    4%
                   TermGroup1M       25.69        0.39       25.69        0.44   -3% -    3%
                        Fuzzy2       42.62        1.83       42.65        1.80   -8% -    8%
                        Fuzzy1       91.74        3.48       91.86        3.44   -7% -    7%
                       Respell       73.96        3.30       74.18        3.29   -8% -    9%
                      Wildcard       56.33        0.97       56.60        1.08   -3% -    4%
                       Prefix3       33.36        0.83       33.59        0.97   -4% -    6%
                  TermBGroup1M       55.58        1.03       56.17        0.88   -2% -    4%
                        IntNRQ       13.38        0.74       13.58        0.94  -10% -   14%
                     OrHighMed       11.71        1.18       11.94        0.97  -14% -   22%
                    OrHighHigh        8.91        0.74        9.13        0.63  -11% -   19%
          
          Show
          Robert Muir added a comment - I rearranged the BM25 in the branch a little bit, its now as fast as lucene's ranking formula: Task QPS tfidf StdDev tfidf QPS bm25 StdDev bm25 Pct diff SpanNear 4.29 0.52 4.14 0.49 -24% - 22% Phrase 3.97 0.25 3.89 0.25 -13% - 11% Term 82.18 4.78 81.00 2.56 -9% - 7% TermBGroup1M1P 83.30 2.41 82.12 2.20 -6% - 4% SloppyPhrase 8.03 0.31 7.93 0.43 -10% - 8% AndHighHigh 19.38 0.59 19.16 0.71 -7% - 5% PKLookup 175.49 4.33 173.67 4.20 -5% - 3% AndHighMed 40.99 1.12 40.71 1.07 -5% - 4% TermGroup1M 25.69 0.39 25.69 0.44 -3% - 3% Fuzzy2 42.62 1.83 42.65 1.80 -8% - 8% Fuzzy1 91.74 3.48 91.86 3.44 -7% - 7% Respell 73.96 3.30 74.18 3.29 -8% - 9% Wildcard 56.33 0.97 56.60 1.08 -3% - 4% Prefix3 33.36 0.83 33.59 0.97 -4% - 6% TermBGroup1M 55.58 1.03 56.17 0.88 -2% - 4% IntNRQ 13.38 0.74 13.58 0.94 -10% - 14% OrHighMed 11.71 1.18 11.94 0.97 -14% - 22% OrHighHigh 8.91 0.74 9.13 0.63 -11% - 19%
          Hide
          David Mark Nemeskey added a comment -

          Hi Robert,

          I would very much like to run this test on the other sims as well. How do I do
          that?

          David

          Show
          David Mark Nemeskey added a comment - Hi Robert, I would very much like to run this test on the other sims as well. How do I do that? David
          Hide
          Robert Muir added a comment -

          There is a project here that lets you benchmark using wikipedia: http://code.google.com/a/apache-extras.org/p/luceneutil/

          You need this patch to benchmark Similarities: http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=6

          (More information on how to get started here: http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/README.txt)

          Show
          Robert Muir added a comment - There is a project here that lets you benchmark using wikipedia: http://code.google.com/a/apache-extras.org/p/luceneutil/ You need this patch to benchmark Similarities: http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=6 (More information on how to get started here: http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/README.txt )
          Hide
          Robert Muir added a comment -

          patch removing all nocommits

          for the fake IDF/phrase issue, i thought it best not to "fake" statistics to SimilarityBase, since the whole point is to make it simpler for implementing/testing ranking models.

          instead it sums scores across terms (kinda like boolean query)

          for DFR P and D, I don't think there are really any great practical ways out of the fundamental problem. I added notes to both of these.

          i think the workaround for dirichlet is fine, i looked around and found another implementation of this smoothing by hiemstra and it had the same workaround (http://mirex.sourceforge.net
          / trec.nist.gov/pubs/trec19/papers/univ.twente.web.rev.pdf)

          all the other similarities seem to work fine being randomly swapped into lucene's tests.

          Show
          Robert Muir added a comment - patch removing all nocommits for the fake IDF/phrase issue, i thought it best not to "fake" statistics to SimilarityBase, since the whole point is to make it simpler for implementing/testing ranking models. instead it sums scores across terms (kinda like boolean query) for DFR P and D, I don't think there are really any great practical ways out of the fundamental problem. I added notes to both of these. i think the workaround for dirichlet is fine, i looked around and found another implementation of this smoothing by hiemstra and it had the same workaround ( http://mirex.sourceforge.net / trec.nist.gov/pubs/trec19/papers/univ.twente.web.rev.pdf) all the other similarities seem to work fine being randomly swapped into lucene's tests.
          Hide
          Robert Muir added a comment -

          attached is a diff (using dev-tools/scripts/diff-sources.py) between the branch and trunk.

          I think this is ready.

          Show
          Robert Muir added a comment - attached is a diff (using dev-tools/scripts/diff-sources.py) between the branch and trunk. I think this is ready.
          Hide
          Robert Muir added a comment -

          oops, i had some stuff in solr/example/work that caused a lot of noise in the patch.

          Show
          Robert Muir added a comment - oops, i had some stuff in solr/example/work that caused a lot of noise in the patch.
          Hide
          Robert Muir added a comment -

          Committed to trunk: Thank you David for the fantastic work here!

          Show
          Robert Muir added a comment - Committed to trunk: Thank you David for the fantastic work here!
          Hide
          Michael McCandless added a comment -

          Thanks David and Robert!

          What an incredible step forward: now you can easily try out all sorts of pre-existing scoring models, or make your own. Yay

          Show
          Michael McCandless added a comment - Thanks David and Robert! What an incredible step forward: now you can easily try out all sorts of pre-existing scoring models, or make your own. Yay
          Hide
          hadas raviv added a comment -

          Hi,

          First of all, I would like to thank you for the great contribution you made by adding the state of the art ranking methods to lucene. I was waiting for these features for a long time, since they enable an IR researcher like me to use lucene, which is a powerful tool, for research purposes.

          I downloaded the last version of lucene trunk and played a little with the models you implemented. There is question I have and I would really appreciate your answer (my apology in advance - I'm new to lucene so maybe this question is trivial for you):

          I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? Is it possible to save the "real" document length coded in some other way (maybe with the new flexible index)? Is there any example for such an implementation? It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue?

          In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future?

          Thanks in advance,
          Hadas

          Show
          hadas raviv added a comment - Hi, First of all, I would like to thank you for the great contribution you made by adding the state of the art ranking methods to lucene. I was waiting for these features for a long time, since they enable an IR researcher like me to use lucene, which is a powerful tool, for research purposes. I downloaded the last version of lucene trunk and played a little with the models you implemented. There is question I have and I would really appreciate your answer (my apology in advance - I'm new to lucene so maybe this question is trivial for you): I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? Is it possible to save the "real" document length coded in some other way (maybe with the new flexible index)? Is there any example for such an implementation? It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue? In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future? Thanks in advance, Hadas
          Hide
          Robert Muir added a comment -

          I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that?

          So that you can switch between ranking models without re-indexing.

          It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue?

          I ran experiments on this a long time ago, the changes were not statistically significant.
          But, there is an issue open to still switch norms to docvalues fields, for other reasons: LUCENE-3221

          In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future?

          No, there won't be any additional work on this issue, GSOC is over.

          Show
          Robert Muir added a comment - I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? So that you can switch between ranking models without re-indexing. It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue? I ran experiments on this a long time ago, the changes were not statistically significant. But, there is an issue open to still switch norms to docvalues fields, for other reasons: LUCENE-3221 In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future? No, there won't be any additional work on this issue, GSOC is over.

            People

            • Assignee:
              Robert Muir
              Reporter:
              David Mark Nemeskey
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 343h
                343h
                Remaining:
                Remaining Estimate - 343h
                343h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development