Lucene - Core
  1. Lucene - Core
  2. LUCENE-5175

Add parameter to lower-bound TF normalization for BM25 (for long documents)

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In the article "When Documents Are Very Long, BM25 Fails!" a fix for the problem is documented. There was a TODO note in BM25Similarity to add this fix. I will attach a patch that implements the fix shortly.

      1. LUCENE-5175.patch
        16 kB
        Tom Burton-West

        Activity

        Tom Burton-West created issue -
        Hide
        Tom Burton-West added a comment -

        Patch adds optional parameter delta to lower-bound tf normalization. Attached also are unit tests.

        Still need to add tests of the explanation/scoring for cases 1) no norms, and 2) no delta

        If no delta parameter is supplied, the math works out to the equivalent of the regular BM25 formula as far as the score, but I think there is an extra step or two to get there. I'll see if I can get some benchmarks running to see if there is any significant performance issue.

        Show
        Tom Burton-West added a comment - Patch adds optional parameter delta to lower-bound tf normalization. Attached also are unit tests. Still need to add tests of the explanation/scoring for cases 1) no norms, and 2) no delta If no delta parameter is supplied, the math works out to the equivalent of the regular BM25 formula as far as the score, but I think there is an extra step or two to get there. I'll see if I can get some benchmarks running to see if there is any significant performance issue.
        Tom Burton-West made changes -
        Field Original Value New Value
        Attachment LUCENE-5175.patch [ 12597752 ]
        Hide
        Robert Muir added a comment -

        I can benchmark your patch with luceneutil Tom.
        I know this thing is sensitive for some reason.

        Really if there is a performance issue, worst case we can just call it BM25L or something?

        Thanks for doing this!

        Show
        Robert Muir added a comment - I can benchmark your patch with luceneutil Tom. I know this thing is sensitive for some reason. Really if there is a performance issue, worst case we can just call it BM25L or something? Thanks for doing this!
        Hide
        Tom Burton-West added a comment -

        Thanks Robert,

        In the article, they claim that the change doesn't have a performance impact. On the other hand, I'm not familiar enough with Java performance to be able to eyeball it, and it looks to me like we added one or more floating point operations, so it would be good to benchmark, especially since the scoring alg gets run against every hit, and we might have millions of hits for a poorly chosen query. (And if we switch to page-level indexing we could have hundreds of millions of hits).

        I was actually considering making it a subclass instead of just modifying BM25Similarity, so that it would be easy to benchmark, and if it turns out that there is a significant perf difference, that users could choose which implementation to use. I saw that computeWeight in BM25Similarity was final and decided I didn't know enough about why this is final to either refactor to create a base class, or change the method in order to subclass.

        Is luceneutil the same as lucene benchmark? I've been wanting to learn how to use lucene benchmark for some time.

        Tom

        Show
        Tom Burton-West added a comment - Thanks Robert, In the article, they claim that the change doesn't have a performance impact. On the other hand, I'm not familiar enough with Java performance to be able to eyeball it, and it looks to me like we added one or more floating point operations, so it would be good to benchmark, especially since the scoring alg gets run against every hit, and we might have millions of hits for a poorly chosen query. (And if we switch to page-level indexing we could have hundreds of millions of hits). I was actually considering making it a subclass instead of just modifying BM25Similarity, so that it would be easy to benchmark, and if it turns out that there is a significant perf difference, that users could choose which implementation to use. I saw that computeWeight in BM25Similarity was final and decided I didn't know enough about why this is final to either refactor to create a base class, or change the method in order to subclass. Is luceneutil the same as lucene benchmark? I've been wanting to learn how to use lucene benchmark for some time. Tom
        Hide
        Robert Muir added a comment -

        Hi Tom:

        I know for a fact i tried to remove the crazy "cache" (I created the monster) that this thing creates, and it always hurts performance for example.

        But I don't think we need to worry too much because:

        1. We should benchmark it the way you have it first and just see what we are dealing with
        2. IF there is a problem, we could try to open it up to subclassing better, maybe it even improves the API
        3. There is also the option of just having specialized SimScorers for the delta=0 case.

        So I am confident we will find a good solution.

        As far as luceneutil we tried creating a README (http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/README.txt) to get started.

        The basic idea is you pull down 2 different checkouts of lucene-trunk and setup a "competition" between the two. There are two options important here: one is to set the similarity for each competitor, the other can disable score comparisons (I havent yet examined the patch to tell if they might differ slightly, e.g. order of floating point ops and stuff).

        But thats typically how i benchmark two Sim impls against each other.

        Show
        Robert Muir added a comment - Hi Tom: I know for a fact i tried to remove the crazy "cache" (I created the monster) that this thing creates, and it always hurts performance for example. But I don't think we need to worry too much because: We should benchmark it the way you have it first and just see what we are dealing with IF there is a problem, we could try to open it up to subclassing better, maybe it even improves the API There is also the option of just having specialized SimScorers for the delta=0 case. So I am confident we will find a good solution. As far as luceneutil we tried creating a README ( http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/README.txt ) to get started. The basic idea is you pull down 2 different checkouts of lucene-trunk and setup a "competition" between the two. There are two options important here: one is to set the similarity for each competitor, the other can disable score comparisons (I havent yet examined the patch to tell if they might differ slightly, e.g. order of floating point ops and stuff). But thats typically how i benchmark two Sim impls against each other.
        Hide
        Tom Burton-West added a comment -

        I wondered about that "crazy cache", in that it makes the implementation dependent on the norms implementation.

        BTW: It looks to me with Lucene's default norms that there are only about 130 or so "document lengths". If there is no boosting going on the byte value has to get to 124 for a doclenth = 1, so there are only 255-124 =131 possible different lengths.

        i=124 norm=1.0,doclen=1.0

        Show
        Tom Burton-West added a comment - I wondered about that "crazy cache", in that it makes the implementation dependent on the norms implementation. BTW: It looks to me with Lucene's default norms that there are only about 130 or so "document lengths". If there is no boosting going on the byte value has to get to 124 for a doclenth = 1, so there are only 255-124 =131 possible different lengths. i=124 norm=1.0,doclen=1.0
        Hide
        Robert Muir added a comment -

        Yes, unfortunately the crazy cache currently is what makes it as fast as DefaultSimilarity, otherwise its 25% slower

        From time to time i definitely upgrade my JVM and run luceneutil to see if these caches can be removed!

        As far as the norms, all the "provided" implementations were just setup to be compatible with DefaultSimilarity.
        so you can test these things out without reindexing, and still have general support for index-time boosts and things like that.

        If you don't care aout that, you can tweak the similarity to better meet your specific needs (and even choose the other direction, too: to compress them to use < 1 byte/doc: LUCENE-5077)

        Show
        Robert Muir added a comment - Yes, unfortunately the crazy cache currently is what makes it as fast as DefaultSimilarity, otherwise its 25% slower From time to time i definitely upgrade my JVM and run luceneutil to see if these caches can be removed! As far as the norms, all the "provided" implementations were just setup to be compatible with DefaultSimilarity. so you can test these things out without reindexing, and still have general support for index-time boosts and things like that. If you don't care aout that, you can tweak the similarity to better meet your specific needs (and even choose the other direction, too: to compress them to use < 1 byte/doc: LUCENE-5077 )
        Tom Burton-West made changes -
        Comment [ I downloaded the luceneutils benchmark suite and the enwiki data and tried to run the out-of-the-box demo. I need to ask our sysadmins to upgrade python versions on our dev machines.

        Tom ]
        Hide
        Tom Burton-West added a comment -

        Hi Robert,

        I tried running luceneutils with the default wikimedium10m collection and tasks. I ran it first on the DefaultSimilarity, which shouldn't be affected by the patch to BM25Similarity and it showed about -2.3% difference. I'm guessing there is some inaccuracy in the tests. When I changed DEFAULT_SIMILARITY to BM25Similarity, the worst change was a difference of -8.8%.

        Is there a separate mailing list for questions about luceneutils or should I write to the java-dev list? or directly to Mike or you?

        Tom

        Show
        Tom Burton-West added a comment - Hi Robert, I tried running luceneutils with the default wikimedium10m collection and tasks. I ran it first on the DefaultSimilarity, which shouldn't be affected by the patch to BM25Similarity and it showed about -2.3% difference. I'm guessing there is some inaccuracy in the tests. When I changed DEFAULT_SIMILARITY to BM25Similarity, the worst change was a difference of -8.8%. Is there a separate mailing list for questions about luceneutils or should I write to the java-dev list? or directly to Mike or you? Tom
        Hide
        Robert Muir added a comment -

        Tom: there is some variation on the test.

        in my localconstants.py i have the following which I found reduces variation significantly:

        JAVA_COMMAND = 'java -Xms4g -Xmx4g -server'
        SEARCH_NUM_THREADS = 1
        

        As far as questions, just send them to the dev@lucene list i think? a lot of committers use this thing so its probably your best bet.

        Show
        Robert Muir added a comment - Tom: there is some variation on the test. in my localconstants.py i have the following which I found reduces variation significantly: JAVA_COMMAND = 'java -Xms4g -Xmx4g -server' SEARCH_NUM_THREADS = 1 As far as questions, just send them to the dev@lucene list i think? a lot of committers use this thing so its probably your best bet.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tom Burton-West
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development