[LUCENE-6687] MLT term frequency calculation bug - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.2.1, 6.0
Fix Version/s: 5.2.2, 8.1, 9.0
Component/s: core/query/scoring, core/queryparser
Labels:
None
Environment:

OS X v10.10.4; Solr 5.2.1

Lucene Fields:

New, Patch Available

Description

In org.apache.lucene.queries.mlt.MoreLikeThis, there's a method retrieveTerms that receives a Map of fields, i.e. a document basically, but it doesn't have to be an existing doc.

There are 2 for loops, one inside the other, which both loop through the same set of fields.
That effectively doubles the term frequency for all the terms from fields that we provide in MLT QP qf parameter.
It basically goes two times over the list of fields and accumulates the term frequencies from all fields into termFreqMap.

The private method retrieveTerms is only called from one public method, the version of overloaded method like that receives a Map: so that private class member fieldNames is always derived from retrieveTerms's argument fields.

Uh, I don't understand what I wrote myself, but that basically means that, by the time retrieveTerms method gets called, its parameter fields and private member fieldNames always contain the same list of fields.

Here's the proof:
These are the final results of the calculation:

And this is the actual thread_id:TID0009 document, where those values were derived from (from fields title_mlt and pagetext_mlt):

Now, let's further test this hypothesis by seeing MLT QP in action from the AdminUI.
Let's try to find docs that are More Like doc TID0009.
Here's the interesting part, the query:

q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009

We just saw, in the last image above, that the term accumulator appears 7 times in TID0009 doc, but the accumulator's TF was calculated as 14.
By using mintf=14, we say that, when calculating similarity, we don't want to consider terms that appear less than 14 times (when terms from fields title_mlt and pagetext_mlt are merged together) in TID0009.
I added the term accumulator in only one other document (TID0004), where it appears only once, in the field title_mlt.

Let's see what happens when we use mintf=15:

I should probably mention that multiple fields (qf) work because I applied the patch: SOLR-7143.

Bug, no?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

buggy-method-usage.png
20/Jul/15 09:14
381 kB
Marko Bonaci
LUCENE-6687.patch
28/Jan/19 10:02
14 kB
Alessandro Benedetti
LUCENE-6687.patch
27/Jan/19 16:18
14 kB
Alessandro Benedetti
LUCENE-6687.patch
31/May/18 15:43
3 kB
Alessandro Benedetti
LUCENE-6687.patch
20/Jul/15 09:43
2 kB
Marko Bonaci
solr-mlt-tf-doubling-bug.png
20/Jul/15 09:14
414 kB
Marko Bonaci
solr-mlt-tf-doubling-bug-results.png
20/Jul/15 09:14
272 kB
Marko Bonaci
solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png
20/Jul/15 09:14
498 kB
Marko Bonaci
solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png
20/Jul/15 09:49
336 kB
Marko Bonaci
terms-accumulator.png
20/Jul/15 09:14
101 kB
Marko Bonaci
terms-angry.png
20/Jul/15 09:14
99 kB
Marko Bonaci
terms-glass.png
20/Jul/15 09:14
97 kB
Marko Bonaci
terms-how.png
20/Jul/15 09:14
97 kB
Marko Bonaci

Issue Links

links to

GitHub Pull Request #389

Activity

People

Assignee:: Tommaso Teofili

Reporter:: Marko Bonaci

Votes:: 2 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 20/Jul/15 09:10

Updated:: 28/Aug/22 14:39

Resolved:: 10/May/19 10:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h