Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/other
    • Labels:
      None

      Description

      http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
      as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
      I have refactored this a bit, added unit tests and improved the runtime somewhat.
      I would like to contribute the code to Lucene under contrib.

      1. BM25SimilarityProvider.java
        3 kB
        Robert Muir
      2. LUCENE-2091.patch
        127 kB
        Yuval Feinstein
      3. persianlucene.jpg
        18 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          I am using a refactored version of this too, looking forward to seeing your implementation!

          Show
          Robert Muir added a comment - I am using a refactored version of this too, looking forward to seeing your implementation!
          Hide
          Otis Gospodnetic added a comment - - edited

          Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of:

          • any of the relevance eval methods
          • indexing performance
          • search performance
          • ...

          Aha, I found something:
          http://markmail.org/message/c2r4v7zj7mduzs5d

          Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default.

          Show
          Otis Gospodnetic added a comment - - edited Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: any of the relevance eval methods indexing performance search performance ... Aha, I found something: http://markmail.org/message/c2r4v7zj7mduzs5d Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default.
          Hide
          Robert Muir added a comment - - edited

          otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations
          Lucene SimpleAnalyzer
          Lucene SimpleAnalyzer + BM25
          Lucene PersianAnalyzer
          Lucene PersianAnalyzer + BM25

          the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op).
          so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words.

          this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this.

          btw, you can play around with openrelevance svn and duplicate my experiments on this same corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi with this impl.

          Show
          Robert Muir added a comment - - edited otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations Lucene SimpleAnalyzer Lucene SimpleAnalyzer + BM25 Lucene PersianAnalyzer Lucene PersianAnalyzer + BM25 the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op). so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words. this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this. btw, you can play around with openrelevance svn and duplicate my experiments on this same corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi with this impl.
          Hide
          Yuval Feinstein added a comment -

          Otis and Robert, Here's my (limited) experience with BM25:
          On a proprietary corpus (alas) I got a nice improvement, which was more pronounced in recall
          (hits that were previously not ranked as top ones, and therefore remained unseen, now appear in the top results).
          I have worked on lowering the BM25 run time to a reasonable level.
          I hope that once this gets into the hands of the Lucene community, BM25 performance
          will approach the current Lucene scoring's performance. This is a tall order,
          as the latter has been in the works for the last eight years or so.
          As for use cases, in my use case BM25 helps, I believe this may be true for other cases.

          Show
          Yuval Feinstein added a comment - Otis and Robert, Here's my (limited) experience with BM25: On a proprietary corpus (alas) I got a nice improvement, which was more pronounced in recall (hits that were previously not ranked as top ones, and therefore remained unseen, now appear in the top results). I have worked on lowering the BM25 run time to a reasonable level. I hope that once this gets into the hands of the Lucene community, BM25 performance will approach the current Lucene scoring's performance. This is a tall order, as the latter has been in the works for the last eight years or so. As for use cases, in my use case BM25 helps, I believe this may be true for other cases.
          Hide
          Robert Muir added a comment -

          Yuval, bm25 has been working nicely for me too.
          on some collections, it really helps, but I haven't yet found a case where it hurts (compared to lucene's current scoring algorithm)

          thanks in advance for working this!

          Show
          Robert Muir added a comment - Yuval, bm25 has been working nicely for me too. on some collections, it really helps, but I haven't yet found a case where it hurts (compared to lucene's current scoring algorithm) thanks in advance for working this!
          Hide
          Joaquin Perez-Iglesias added a comment - - edited

          Hi Otis, Robert and Yuval.

          I developed this add-on for Lucene in 2008, for some experiments that I was doing, and I would like to express my impressions about this.

          In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
          BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection. I made publish some results with the Eurogov collection some time ago.

          I can show you now some experiments with TREC Disk4&5 collection, these results have been obtained with default parameters with the Robust track topics. As you can see BM25 improves the Lucene-VSM ranking function.

          MAP P@5
          VSM 0.2079 0.4096
          BM25 0.2340 0.4578

          This implementation is getting more popular and I know that some people is using it on their research, thus it will be really nice if at some point it is included in the core.

          The only concerns that I have about it, are related with:

          • Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.
          • IDF cannot be calculated at a document level (this is important for BM25F).
          • Another issue is related with computing the document average length, but this could be easily solved.

          These issues are described in detail in the documentation that I made public in my website.

          Thanks to all for your interest and work.

          Joaquin Perez-Iglesias

          Show
          Joaquin Perez-Iglesias added a comment - - edited Hi Otis, Robert and Yuval. I developed this add-on for Lucene in 2008, for some experiments that I was doing, and I would like to express my impressions about this. In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance. BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection. I made publish some results with the Eurogov collection some time ago. I can show you now some experiments with TREC Disk4&5 collection, these results have been obtained with default parameters with the Robust track topics. As you can see BM25 improves the Lucene-VSM ranking function. MAP P@5 VSM 0.2079 0.4096 BM25 0.2340 0.4578 This implementation is getting more popular and I know that some people is using it on their research, thus it will be really nice if at some point it is included in the core. The only concerns that I have about it, are related with: Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery. IDF cannot be calculated at a document level (this is important for BM25F). Another issue is related with computing the document average length, but this could be easily solved. These issues are described in detail in the documentation that I made public in my website. Thanks to all for your interest and work. Joaquin Perez-Iglesias
          Hide
          Robert Muir added a comment -

          In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
          BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection

          Just wanted to mention that in the results I have provided, I never change the default parameters (B,K1) either.

          Show
          Robert Muir added a comment - In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance. BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection Just wanted to mention that in the results I have provided, I never change the default parameters (B,K1) either.
          Hide
          Yuval Feinstein added a comment -

          I have attached a patch for this case. This is Joaquin's BM25 and BM25F library. I refactored it a bit for speed. It should work with the trunk. I have tested it lightly, mainly through unit tests. Your feedback is more than welcome.

          Show
          Yuval Feinstein added a comment - I have attached a patch for this case. This is Joaquin's BM25 and BM25F library. I refactored it a bit for speed. It should work with the trunk. I have tested it lightly, mainly through unit tests. Your feedback is more than welcome.
          Hide
          Yuval Feinstein added a comment -

          So what is the next step? Robert, do you have time to look at this? Or should I assign this to someone else? And how?
          Sorry if these are trivial questions, but I'm a newbie...

          Show
          Yuval Feinstein added a comment - So what is the next step? Robert, do you have time to look at this? Or should I assign this to someone else? And how? Sorry if these are trivial questions, but I'm a newbie...
          Hide
          Robert Muir added a comment - - edited

          Hi Yuval, I see your patch, I can help with some relevance testing and comments.

          I don't know if it should be assigned to me, maybe we can trick one of the devs who really knows the scoring system to well to look at it, especially about performance and things like that.

          Here is the first thing I noticed, maybe I am completely stupid but I never understood this:

          I don't understand why we need BM25Boolean.* and everything like that. I don't understand why these are necessary, they seem to be duplicates of BooleanQuery etc and just sum up subscorers or whatever.

          So in my usages I dropped them. I just have BM25TermQuery,BM25TermScorer, and BM25Parameters and to use it, I override a method in QueryParser.

          edit: by the way, I don't want to imply that what I am doing is "best" either, because I don't think it is, only that this would be one way to simplify the code a lot as a first step.

          Show
          Robert Muir added a comment - - edited Hi Yuval, I see your patch, I can help with some relevance testing and comments. I don't know if it should be assigned to me, maybe we can trick one of the devs who really knows the scoring system to well to look at it, especially about performance and things like that. Here is the first thing I noticed, maybe I am completely stupid but I never understood this: I don't understand why we need BM25Boolean.* and everything like that. I don't understand why these are necessary, they seem to be duplicates of BooleanQuery etc and just sum up subscorers or whatever. So in my usages I dropped them. I just have BM25TermQuery,BM25TermScorer, and BM25Parameters and to use it, I override a method in QueryParser. edit: by the way, I don't want to imply that what I am doing is "best" either, because I don't think it is, only that this would be one way to simplify the code a lot as a first step.
          Hide
          Uwe Schindler added a comment - - edited

          I was wondering about the separate BooleanQuery, too, as it is almost simply a copy (of an old version of it). The question is more, why do we need the BM25 classes at all, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the Scorers. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that.

          That was just my first impression, these additional classes do not look like a good public API to me. Query classes should be abstract wrappers for Weights and Scoreres. The internal impl like BM25 or conventional scoring should be hidden from the user (and maybe properties e.g. on the IndexSearcher to use BM25 scoring). This way, it could also be used for other query types (not only TermQ/BQ), but eg. for function queries (to further change the score) or FuzzyQuery and what else.

          If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API.

          Show
          Uwe Schindler added a comment - - edited I was wondering about the separate BooleanQuery, too, as it is almost simply a copy (of an old version of it). The question is more, why do we need the BM25 classes at all, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the Scorers. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that. That was just my first impression, these additional classes do not look like a good public API to me. Query classes should be abstract wrappers for Weights and Scoreres. The internal impl like BM25 or conventional scoring should be hidden from the user (and maybe properties e.g. on the IndexSearcher to use BM25 scoring). This way, it could also be used for other query types (not only TermQ/BQ), but eg. for function queries (to further change the score) or FuzzyQuery and what else. If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API.
          Hide
          Robert Muir added a comment -

          Uwe, yes I agree it would be nicer to do a tighter integration.

          I am suggesting we tackle it one step at a time, first we can answer this question, next we can talk about average document length and other tricky things like that.

          For this I calculate it from norms, as Doug suggested here: https://issues.apache.org/jira/browse/LUCENE-965?focusedCommentId=12515803&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12515803 , but lets get to this later.

          Show
          Robert Muir added a comment - Uwe, yes I agree it would be nicer to do a tighter integration. I am suggesting we tackle it one step at a time, first we can answer this question, next we can talk about average document length and other tricky things like that. For this I calculate it from norms, as Doug suggested here: https://issues.apache.org/jira/browse/LUCENE-965?focusedCommentId=12515803&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12515803 , but lets get to this later.
          Hide
          Robert Muir added a comment -

          Hi Yuval, I think we should also consider BM25F being the really relevant version for Lucene, as it is field-oriented (and dropping document-oriented BM25).

          because BM25F makes more sense for lucene, in my opinion. If you are searching BODY for some keywords, who cares how long the AUTHOR field is!

          I think narrowing focus to BM25F would eliminate confusion about things such as "average document length" and then we work with just field and "average field length" and more people will have ideas.

          Show
          Robert Muir added a comment - Hi Yuval, I think we should also consider BM25F being the really relevant version for Lucene, as it is field-oriented (and dropping document-oriented BM25). because BM25F makes more sense for lucene, in my opinion. If you are searching BODY for some keywords, who cares how long the AUTHOR field is! I think narrowing focus to BM25F would eliminate confusion about things such as "average document length" and then we work with just field and "average field length" and more people will have ideas.
          Hide
          Otis Gospodnetic added a comment -

          +1 for skipping BM25 and going straight to BM25F.

          I think the answer to Uwe's question about why this can't just be a different Similarity or some such is that BM25 requires some data that Lucene currently doesn't collect. That's what there were some of those static methods in examples on the author's site. I think what I'm saying is correct.

          Show
          Otis Gospodnetic added a comment - +1 for skipping BM25 and going straight to BM25F. I think the answer to Uwe's question about why this can't just be a different Similarity or some such is that BM25 requires some data that Lucene currently doesn't collect. That's what there were some of those static methods in examples on the author's site. I think what I'm saying is correct.
          Hide
          Joaquin Perez-Iglesias added a comment -

          Hi everybody,

          I'm going to try to answer some of your questions, when I started to develop this library I didn't want
          to modify the Lucene code, moreover I tried to create a jar that could be straight added to the official
          Lucene distribution. That is the main reason why there are some duplicated classes.
          So yes it would be better a tigher integration, and I believe we will get more support for different query types.

          In relation with BM25 or BM25F they are equivalent, BM25F is the version for more than a field, so yes go for BM25F.
          What it is really important is the way boost factors are applied, as you can see in the equation these must be applied to raw frequencies and not to normalized frequencies or saturated frequencies.
          (Currently Lucene is doing it after normalization and saturation of frequencies, what in my opinion is not the best approach.)
          A more detailed explanation of BM25F and this issue can be found in this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5255

          The problem, as I said, comes from IDF. In the BM25 equations family, IDF is always computed at document level (that is why
          I recommend as heuristic to use the field with more terms, or use an special field that contains all the terms). As far as I know that is a problem
          because Lucene doesn't store the document frequency per document but per field.

          Otis is right as far as I know just changing similarity is not enough, some data is not available to TermScorer neither similarity and TermScorer
          apply the obtained values from similarity in a way that make it incompatible with BM25.
          It is really important to follow the steps as it appears in my explanation:

          1. Normalize frequencies with document/field length and b factor.
          2. Saturate the effect of frequency with k1
          3. Compute summatory of terms weights
          4. Apply IDF

          I really believe that this can be done (not sure how), so maybe we will need the suggestions of some 'scorer guru'.

          Show
          Joaquin Perez-Iglesias added a comment - Hi everybody, I'm going to try to answer some of your questions, when I started to develop this library I didn't want to modify the Lucene code, moreover I tried to create a jar that could be straight added to the official Lucene distribution. That is the main reason why there are some duplicated classes. So yes it would be better a tigher integration, and I believe we will get more support for different query types. In relation with BM25 or BM25F they are equivalent, BM25F is the version for more than a field, so yes go for BM25F. What it is really important is the way boost factors are applied, as you can see in the equation these must be applied to raw frequencies and not to normalized frequencies or saturated frequencies. (Currently Lucene is doing it after normalization and saturation of frequencies, what in my opinion is not the best approach.) A more detailed explanation of BM25F and this issue can be found in this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5255 The problem, as I said, comes from IDF. In the BM25 equations family, IDF is always computed at document level (that is why I recommend as heuristic to use the field with more terms, or use an special field that contains all the terms). As far as I know that is a problem because Lucene doesn't store the document frequency per document but per field. Otis is right as far as I know just changing similarity is not enough, some data is not available to TermScorer neither similarity and TermScorer apply the obtained values from similarity in a way that make it incompatible with BM25. It is really important to follow the steps as it appears in my explanation: 1. Normalize frequencies with document/field length and b factor. 2. Saturate the effect of frequency with k1 3. Compute summatory of terms weights 4. Apply IDF I really believe that this can be done (not sure how), so maybe we will need the suggestions of some 'scorer guru'.
          Hide
          Uwe Schindler added a comment - - edited

          Thanks for the explanation!

          About the IDF: The problem with a per-document IDF in lucene would be that most users also add fields that are e.g. catch-all fields (which would be the per doc IDF) but in addition they add special fields like numeric fields (which would not produce a good IDF, but at the moment this IDF is ignored). Some users also add fields simply for sorting. So a IDF for documents is impossible with Lucene. You can only use e.g. catch all fields (which are always a godd idea for non-fielded searches, because oring all fields together is slower that just indexing the same terms a second time in a catch-all field), e.g. "contents" contains all terms from "title", "subject", "mailtext" as an example for emails. But the IDF for BM25F could be taken from the "contents" field even when searching only for a title.

          Show
          Uwe Schindler added a comment - - edited Thanks for the explanation! About the IDF: The problem with a per-document IDF in lucene would be that most users also add fields that are e.g. catch-all fields (which would be the per doc IDF) but in addition they add special fields like numeric fields (which would not produce a good IDF, but at the moment this IDF is ignored). Some users also add fields simply for sorting. So a IDF for documents is impossible with Lucene. You can only use e.g. catch all fields (which are always a godd idea for non-fielded searches, because oring all fields together is slower that just indexing the same terms a second time in a catch-all field), e.g. "contents" contains all terms from "title", "subject", "mailtext" as an example for emails. But the IDF for BM25F could be taken from the "contents" field even when searching only for a title.
          Hide
          Otis Gospodnetic added a comment -

          Joaquin - could you please explain what you mean by "Saturate the effect of frequency with k1"? Thanks.

          Show
          Otis Gospodnetic added a comment - Joaquin - could you please explain what you mean by "Saturate the effect of frequency with k1"? Thanks.
          Hide
          Joaquin Perez-Iglesias added a comment - - edited

          Yes sorry.

          Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency).
          In Lucene this is carried out with the root square of the frequency, another classical approach
          is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency.

          BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the
          equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection.

          (Uwe) Related with the IDF issue, I believe that the more correct approach (in theoretical terms), would be to use the docFreq on the fields where the user wants to search but I don't think that this can be done.
          For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.

          So maybe the best available approach would be to use IDF per field. What do you think?

          Show
          Joaquin Perez-Iglesias added a comment - - edited Yes sorry. Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency). In Lucene this is carried out with the root square of the frequency, another classical approach is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency. BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection. (Uwe) Related with the IDF issue, I believe that the more correct approach (in theoretical terms), would be to use the docFreq on the fields where the user wants to search but I don't think that this can be done. For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields. So maybe the best available approach would be to use IDF per field. What do you think?
          Hide
          Michael McCandless added a comment -

          Could someone summarize what's missing in Lucene's index format, to allow a query's scorer to efficiently compute BM25?

          It looks like these two are important for BM25F:

          • How many times does term T occur in all fields for doc D?
          • How many documents contain term T in any of their fields? (for computing document level IDF)

          Is there anything else?

          Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.

          This is concerning – is there no way to score a PhraseQuery in BM25F?

          Show
          Michael McCandless added a comment - Could someone summarize what's missing in Lucene's index format, to allow a query's scorer to efficiently compute BM25? It looks like these two are important for BM25F: How many times does term T occur in all fields for doc D? How many documents contain term T in any of their fields? (for computing document level IDF) Is there anything else? Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery. This is concerning – is there no way to score a PhraseQuery in BM25F?
          Hide
          Simon Willnauer added a comment -

          Joaquin,

          For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.

          can you explain this a little more in detail. From my understanding, if I use a BooleanQuery a:x1 AND b:x2 TermWeigth will calculate the IDF for Term(a, x1) and Term(b, x2), am I missing something?

          Show
          Simon Willnauer added a comment - Joaquin, For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields. can you explain this a little more in detail. From my understanding, if I use a BooleanQuery a:x1 AND b:x2 TermWeigth will calculate the IDF for Term(a, x1) and Term(b, x2), am I missing something?
          Hide
          Joaquin Perez-Iglesias added a comment - - edited

          Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.

          If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).

          That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.

          (Michael)
          In summary it will be nice to have:

          1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
          2. The Collection Average Document Length and Collection Average Field Length (per each field).

          I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.

          I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.

          At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover what I don't really know is how to do it .

          Show
          Joaquin Perez-Iglesias added a comment - - edited Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done. If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2). That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields. (Michael) In summary it will be nice to have: 1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough. 2. The Collection Average Document Length and Collection Average Field Length (per each field). I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document. I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available. At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover what I don't really know is how to do it .
          Hide
          Michael McCandless added a comment -

          I think that should be possible to implement BM25F for it

          Ahh OK I just misunderstood – BM25 can score PhraseQuery; it's just that the current patch doesn't implement that.

          1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.

          OK, catch all seems like an OK starting point. I wonder if we could enable storing terms dict but not postings... then we could store catch all just for the terms stats, so we wouldn't waste disk space. Though merging gets tricky, since we'd have to walk postings for all fields (or at least all involved in BM25F) in parallel, re-computing the catch-all stats.

          2. The Collection Average Document Length and Collection Average Field Length (per each field).

          Lucene doesn't store/compute this today... we can easily compute these stats for newly created segments, and record in the segments file, but then recomputing them during segment merging with deletions gets tricky. We could just take the linear approximate avg with deletions, but that may end up being too approximate, so we could instead make a dedicated posting list, which would be properly merged, but we'd then have to re-walk to compute the stats for the newly merged segment.

          Show
          Michael McCandless added a comment - I think that should be possible to implement BM25F for it Ahh OK I just misunderstood – BM25 can score PhraseQuery; it's just that the current patch doesn't implement that. 1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough. OK, catch all seems like an OK starting point. I wonder if we could enable storing terms dict but not postings... then we could store catch all just for the terms stats, so we wouldn't waste disk space. Though merging gets tricky, since we'd have to walk postings for all fields (or at least all involved in BM25F) in parallel, re-computing the catch-all stats. 2. The Collection Average Document Length and Collection Average Field Length (per each field). Lucene doesn't store/compute this today... we can easily compute these stats for newly created segments, and record in the segments file, but then recomputing them during segment merging with deletions gets tricky. We could just take the linear approximate avg with deletions, but that may end up being too approximate, so we could instead make a dedicated posting list, which would be properly merged, but we'd then have to re-walk to compute the stats for the newly merged segment.
          Hide
          Grant Ingersoll added a comment -

          I haven't looked at the patch yet, but...

          Should we take just a small step back and consider what it would take to actually make scoring more pluggable instead of just thinking about how best to integrate BM25? In other words, someone else has also donated an implementation of the Axiomatic Retr. Function. Much like BM25, I think it also requires avg. doc length, as does (I believe) language modeling and some other approaches. Of course, we need to do this in a way that doesn't hurt performance for the default case.

          I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.

          Show
          Grant Ingersoll added a comment - I haven't looked at the patch yet, but... Should we take just a small step back and consider what it would take to actually make scoring more pluggable instead of just thinking about how best to integrate BM25? In other words, someone else has also donated an implementation of the Axiomatic Retr. Function. Much like BM25, I think it also requires avg. doc length, as does (I believe) language modeling and some other approaches. Of course, we need to do this in a way that doesn't hurt performance for the default case. I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.
          Hide
          Robert Muir added a comment -

          In other words, someone else has also donated an implementation of the Axiomatic Retr. Function.

          I've never been able to get that scoring function to do anything more than be consistently worse than the default Lucene formula. I tried at least 3 test collections with it...

          I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.

          Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way.

          But maybe I made some mistakes in both these cases, so anyone feel free to try this themselves.

          Show
          Robert Muir added a comment - In other words, someone else has also donated an implementation of the Axiomatic Retr. Function. I've never been able to get that scoring function to do anything more than be consistently worse than the default Lucene formula. I tried at least 3 test collections with it... I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone. Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way. But maybe I made some mistakes in both these cases, so anyone feel free to try this themselves.
          Hide
          Grant Ingersoll added a comment -

          Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way.

          Yeah, can't speak for SweetSpot, but there are other approaches too that don't favor shorter docs all the time.

          Show
          Grant Ingersoll added a comment - Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way. Yeah, can't speak for SweetSpot, but there are other approaches too that don't favor shorter docs all the time.
          Hide
          Robert Muir added a comment -

          Yeah, can't speak for SweetSpot, but there are other approaches too that don't favor shorter docs all the time.

          This is why it would be interesting if someone came up with a different modification (preferably one that isn't corpus-specific tuning) that actually works for that one corpus.

          Its in openrelevance svn, anyone can try.

          Show
          Robert Muir added a comment - Yeah, can't speak for SweetSpot, but there are other approaches too that don't favor shorter docs all the time. This is why it would be interesting if someone came up with a different modification (preferably one that isn't corpus-specific tuning) that actually works for that one corpus. Its in openrelevance svn, anyone can try.
          Hide
          Robert Muir added a comment - - edited

          Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

          Its of interest how they modified BM25's idf formula slightly in a way to improve results when no stopwords list is used. I'm curious what you think about this as it looks like a potential improvement for people not using stopwords (multilingual situation, etc)

          edit here is the quote: for simplicity

          Using the original idf formula
          idf =log[(n−dfj +0.5)/(dfj +0.5)], we have noticed
          that when the underlying term tj occurs in more than half of
          the documents (dfj >n/2), the resulting idf value would be
          negative, and the final document score also could be negative.
          As a means of estimating idf,we therefore suggest a new variant
          defined as idf =log{1+[(n−dfj +0.5)/(dfj +0.5)]}.
          
          Show
          Robert Muir added a comment - - edited Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf Its of interest how they modified BM25's idf formula slightly in a way to improve results when no stopwords list is used. I'm curious what you think about this as it looks like a potential improvement for people not using stopwords (multilingual situation, etc) edit here is the quote: for simplicity Using the original idf formula idf =log[(n−dfj +0.5)/(dfj +0.5)], we have noticed that when the underlying term tj occurs in more than half of the documents (dfj >n/2), the resulting idf value would be negative, and the final document score also could be negative. As a means of estimating idf,we therefore suggest a new variant defined as idf =log{1+[(n−dfj +0.5)/(dfj +0.5)]}.
          Hide
          Joaquin Perez-Iglesias added a comment -

          It is a consequence of the logarithm, you can get negative numbers, and a negative score doesn't have to much sense. As far as I know this version of IDF is pretty theoretical and based on the binary independence model (BIR), so transform the products of probabilities into a summation of logarithms. Anyway it is quite usual to add a 1 to the final result before applying the logarithm to avoid situations like previous.

          In my opinion it should be added to the patch. It doesn't hurt but it helps

          This stuff is clearly explained on the wikipedia http://en.wikipedia.org/wiki/Okapi_BM25.

          Just a quote from Wikipedia

          Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:

          • Each summand can be given a floor of 0, to trim out common terms;
          • The IDF function can be given a floor of a constant ε, to avoid common terms being ignored at all;
          • The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.
          Show
          Joaquin Perez-Iglesias added a comment - It is a consequence of the logarithm, you can get negative numbers, and a negative score doesn't have to much sense. As far as I know this version of IDF is pretty theoretical and based on the binary independence model (BIR), so transform the products of probabilities into a summation of logarithms. Anyway it is quite usual to add a 1 to the final result before applying the logarithm to avoid situations like previous. In my opinion it should be added to the patch. It doesn't hurt but it helps This stuff is clearly explained on the wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 . Just a quote from Wikipedia Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way: Each summand can be given a floor of 0, to trim out common terms; The IDF function can be given a floor of a constant ε, to avoid common terms being ignored at all; The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.
          Hide
          Vinay Setty added a comment -

          I am trying to compile the BM25 classes for Lucene using the source code in http://nlp.uned.es/~jperezi/Lucene-BM25/jar/models.jar. But it wouldn't compile with later vesrion of Lucene i.e. lucene-3.0.1. I see that many of the interfaces org.apache.lucene.search.Query, org.apache.lucene.search.Scorer, and org.apache.lucene.search.Weight are all changed. Does anyone have a modified version of BM25 classes which works with latest version of Lucene?

          Show
          Vinay Setty added a comment - I am trying to compile the BM25 classes for Lucene using the source code in http://nlp.uned.es/~jperezi/Lucene-BM25/jar/models.jar . But it wouldn't compile with later vesrion of Lucene i.e. lucene-3.0.1. I see that many of the interfaces org.apache.lucene.search.Query, org.apache.lucene.search.Scorer, and org.apache.lucene.search.Weight are all changed. Does anyone have a modified version of BM25 classes which works with latest version of Lucene?
          Hide
          Yuval Feinstein added a comment -

          @Vinay - Please take the LUCENE-2091.patch file from this very JIRA issue, and try to apply it. I modified the code so that it works with Lucene 2.9.1, and therefore should be compatible with Lucene 3.0.1 as well. The only feature missing is the explain() function. If this requires further modifications to work with Lucene 3.0.1, please let me know and I will try to handle the changes.

          Show
          Yuval Feinstein added a comment - @Vinay - Please take the LUCENE-2091 .patch file from this very JIRA issue, and try to apply it. I modified the code so that it works with Lucene 2.9.1, and therefore should be compatible with Lucene 3.0.1 as well. The only feature missing is the explain() function. If this requires further modifications to work with Lucene 3.0.1, please let me know and I will try to handle the changes.
          Hide
          Katja Hofmann added a comment -

          Hi Yuval,

          I was trying to apply the patch to both lucene 2.9.2 and 3.0.1. In both cases it failed to compile. The problem is that it assumes a method decodeNormValue(byte) in Similarity. I wasn't able to find a version where this method exists. Do you have any suggestions on how to proceed?

          The compilation errors were:

          [javac] ...code/java/org/apache/lucene/bm25/BM25TermScorer.java:79: cannot find symbol
          [javac] symbol : method decodeNormValue(byte)
          [javac] location: class org.apache.lucene.search.Similarity
          [javac] float fieldNorm = this.getSimilarity().decodeNormValue(this.norm[this.docID()]);
          [javac] ^
          [javac] ...code/java/org/apache/lucene/bm25f/BM25FTermScorer.java:148: cannot find symbol
          [javac] symbol : method decodeNormValue(byte)
          [javac] location: class org.apache.lucene.search.Similarity
          [javac] float fieldNorm = this.getSimilarity().decodeNormValue(norms[i][this.docID()]);
          [javac]

          Show
          Katja Hofmann added a comment - Hi Yuval, I was trying to apply the patch to both lucene 2.9.2 and 3.0.1. In both cases it failed to compile. The problem is that it assumes a method decodeNormValue(byte) in Similarity. I wasn't able to find a version where this method exists. Do you have any suggestions on how to proceed? The compilation errors were: [javac] ...code/java/org/apache/lucene/bm25/BM25TermScorer.java:79: cannot find symbol [javac] symbol : method decodeNormValue(byte) [javac] location: class org.apache.lucene.search.Similarity [javac] float fieldNorm = this.getSimilarity().decodeNormValue(this.norm [this.docID()] ); [javac] ^ [javac] ...code/java/org/apache/lucene/bm25f/BM25FTermScorer.java:148: cannot find symbol [javac] symbol : method decodeNormValue(byte) [javac] location: class org.apache.lucene.search.Similarity [javac] float fieldNorm = this.getSimilarity().decodeNormValue(norms [i] [this.docID()] ); [javac]
          Hide
          Yuval Feinstein added a comment -

          Hi Katja.
          I am adding this to my task list.
          I hope to get around to it in a few days.
          In the meantime, I suggest you try replacing the instance method call
          this.getSimilarity().decodeNormValue()
          by the static method
          Similarity.decodeNorm()

          Please let me know if it works; If it does, I will change the patch accordingly.

          Show
          Yuval Feinstein added a comment - Hi Katja. I am adding this to my task list. I hope to get around to it in a few days. In the meantime, I suggest you try replacing the instance method call this.getSimilarity().decodeNormValue() by the static method Similarity.decodeNorm() Please let me know if it works; If it does, I will change the patch accordingly.
          Hide
          Katja Hofmann added a comment -

          Yuval - thanks for your quick feedback!
          The change you suggested works; now it compiles without problems (I used lucene 3.0.1)

          Best regards,

          Katja

          Show
          Katja Hofmann added a comment - Yuval - thanks for your quick feedback! The change you suggested works; now it compiles without problems (I used lucene 3.0.1) Best regards, Katja
          Hide
          Robert Muir added a comment -

          Hello, I think there might be a bug in the formula here.

          While working on LUCENE-2392, I have implemented several formula, including BM25 and I(Ne)C2/I(Ne)B2 from the divergence-from-randomness models.

          I first implemented DFR and after verifying the relevance was correct against published results, I noticed BM25 (with this patch) had much lower relevance scores. This didn't make a lot of sense to me.

          So I implemented BM25 from scratch, in terms of the stats framework of LUCENE-2392.

          Here are some MAP scores on an english text collection (I have tested several and this is the consistent behavior):

          DefaultSimilarity 0.4333
          BM25 (LUCENE-2091) 0.4131
          BM25 (LUCENE-2392) 0.4395
          BM25* (LUCENE-2392) 0.4401
          DFR I(Ne)C2 0.4453
          DFR I(Ne)B2 0.4475

          Notes:

          • BM25* is what is actually attached, as it uses the flooring for IDF, but i also tested without this (BM25)
          • I used the same defaults as this patch for k1 and b parameters.

          I've attached my implementation for the sake of it, although the APIs are pretty meaningless right now, but just so you can see how I implemented the equation.

          Show
          Robert Muir added a comment - Hello, I think there might be a bug in the formula here. While working on LUCENE-2392 , I have implemented several formula, including BM25 and I(Ne)C2/I(Ne)B2 from the divergence-from-randomness models. I first implemented DFR and after verifying the relevance was correct against published results, I noticed BM25 (with this patch) had much lower relevance scores. This didn't make a lot of sense to me. So I implemented BM25 from scratch, in terms of the stats framework of LUCENE-2392 . Here are some MAP scores on an english text collection (I have tested several and this is the consistent behavior): DefaultSimilarity 0.4333 BM25 ( LUCENE-2091 ) 0.4131 BM25 ( LUCENE-2392 ) 0.4395 BM25* ( LUCENE-2392 ) 0.4401 DFR I(Ne)C2 0.4453 DFR I(Ne)B2 0.4475 Notes: BM25* is what is actually attached, as it uses the flooring for IDF, but i also tested without this (BM25) I used the same defaults as this patch for k1 and b parameters. I've attached my implementation for the sake of it, although the APIs are pretty meaningless right now, but just so you can see how I implemented the equation.
          Hide
          Vinay Setty added a comment -

          @Joaquin, I want to use BM25 scoring for evaluating phrase queries, I have created a positional index in Lucene, but have no clue how to use it for evaluating phrase queries using BM25 scorer. I had a quick look at the code, by default the queries are boolean, and could not find a easy way to make it phrase query. Any ideas?

          Show
          Vinay Setty added a comment - @Joaquin, I want to use BM25 scoring for evaluating phrase queries, I have created a positional index in Lucene, but have no clue how to use it for evaluating phrase queries using BM25 scorer. I had a quick look at the code, by default the queries are boolean, and could not find a easy way to make it phrase query. Any ideas?
          Hide
          Yuval Feinstein added a comment -

          @Vinay - I have this suggestion. I am unsure whether it will work.
          First, I would implement the BM25BooleanQuery, and use it to create a QueryWrapperFilter qwf.
          (See http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/QueryWrapperFilter.html)
          Next, I would create a Phrase query, and call search(phraseQuery, qwf, 50).
          This way, the scorer will first look for matches for the BM25 query, and later look among them for matches for the phrase query.
          Hope this is understandable.
          – Yuval

          Show
          Yuval Feinstein added a comment - @Vinay - I have this suggestion. I am unsure whether it will work. First, I would implement the BM25BooleanQuery, and use it to create a QueryWrapperFilter qwf. (See http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/QueryWrapperFilter.html ) Next, I would create a Phrase query, and call search(phraseQuery, qwf, 50). This way, the scorer will first look for matches for the BM25 query, and later look among them for matches for the phrase query. Hope this is understandable. – Yuval
          Hide
          Ian Holsman added a comment -

          Hi Rob.
          your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that..

          TIA
          also I'm guessing that those rely on 2392, and provides an alternate implementation to this. Should we just close this as a duplicate to 2392 ?

          Show
          Ian Holsman added a comment - Hi Rob. your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that.. TIA also I'm guessing that those rely on 2392, and provides an alternate implementation to this. Should we just close this as a duplicate to 2392 ?
          Hide
          Robert Muir added a comment -

          your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that.

          Yeah this is from LUCENE-2392. Unfortunately it won't work with the most recent patch there, but both patches are just really exploration to see how we can divide into subtasks.

          For an update, the JIRA issues aren't well linked but we have actually made pretty good progress on some major portions (imo these are the most interesting):

          The next big step is to separate scoring from matching (see the latest patch on LUCENE-2392) so that similarity has full responsibility for all calculations, and so we get full integration with all queries, etc.

          This isn't that complicated: however, in order to do this, we need to first refactor Explanations, so that a Similarity has the capability (and responsibility!) to fully explain its calculations. So I think this is the next issue to resolve before going any further.

          Show
          Robert Muir added a comment - your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that. Yeah this is from LUCENE-2392 . Unfortunately it won't work with the most recent patch there, but both patches are just really exploration to see how we can divide into subtasks. For an update, the JIRA issues aren't well linked but we have actually made pretty good progress on some major portions (imo these are the most interesting): Collection term stats: LUCENE-2862 per-field similarity: LUCENE-2236 termstate, to avoid redundant i/o for stats: LUCENE-2694 norms cleanup: LUCENE-2771 , LUCENE-2846 The next big step is to separate scoring from matching (see the latest patch on LUCENE-2392 ) so that similarity has full responsibility for all calculations, and so we get full integration with all queries, etc. This isn't that complicated: however, in order to do this, we need to first refactor Explanations, so that a Similarity has the capability (and responsibility!) to fully explain its calculations. So I think this is the next issue to resolve before going any further.
          Hide
          ian towey added a comment -

          Not sure am i using this BM25BooleanQuery correctly, getting variation in the number of hits when testing v QueryParser. Is there limitations to the query string that BM25BooleanQuery can deal with, e.g. "gas OR ((oil AND car) NOT ship)", the results returned by BM25BooleanQuery seem to be the all docs that don't contain the term "ship", (comparing BM25BooleanQuery v QueryParser)

          Show
          ian towey added a comment - Not sure am i using this BM25BooleanQuery correctly, getting variation in the number of hits when testing v QueryParser. Is there limitations to the query string that BM25BooleanQuery can deal with, e.g. "gas OR ((oil AND car) NOT ship)", the results returned by BM25BooleanQuery seem to be the all docs that don't contain the term "ship", (comparing BM25BooleanQuery v QueryParser)
          Hide
          Erick Erickson added a comment -

          Should this be closed as duplicate of LUCENE-2959?

          Show
          Erick Erickson added a comment - Should this be closed as duplicate of LUCENE-2959 ?
          Hide
          Grant Ingersoll added a comment -

          BM25 has been added in another issue.

          Show
          Grant Ingersoll added a comment - BM25 has been added in another issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Yuval Feinstein
            • Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development