Issue Details (XML | Word | Printable)

Key: LUCENE-443
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Yonik Seeley
Reporter: Abdul Chaudhry
Votes: 7
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

ConjunctionScorer tune-up

Created: 02/Oct/05 02:24 PM   Updated: 27/Feb/07 06:10 PM
Return to search
Component/s: Search
Affects Version/s: 1.9
Fix Version/s: 2.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works Conjunction20060921.patch 2006-09-21 07:17 AM Paul Elschot 5 kB
Java Source File Licensed for inclusion in ASF works ConjunctionScorer.java 2005-10-03 04:22 AM Paul Elschot 4 kB
Java Source File Licensed for inclusion in ASF works ConjunctionScorer.java 2005-10-02 02:32 PM Abdul Chaudhry 4 kB
Environment: Linux, Java 1.5, Large Index with 4 million items and some heavily nested boolean queries

Resolution Date: 17/Oct/06 08:51 PM


 Description  « Hide
I just recently ran a load test on the latest code from lucene , which is using a new BooleanScore and noticed the ConjunctionScorer was crunching through objects , especially while sorting as part of the skipTo call. It turns a linked list into an array, sorts the array, then converts the array back to a linked list for further processing by the scoring engines below.

'm not sure if anyone else is experiencing this as I have a very large index (> 4 million items) and I am issuing some heavily nested queries

Anyway, I decide to change the link list into an array and use a first and last marker to "simulate" a linked list.

This scaled much better during my load test as the java gargbage collector was less - umm - virulent



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Abdul Chaudhry added a comment - 02/Oct/05 02:32 PM
Updated a version of the ConjunctionScorer which uses an array rather than a linked list.
The first(), last() and doc() method calls should probably be in-lined - maybe.

Abdul Chaudhry made changes - 02/Oct/05 02:32 PM
Field Original Value New Value
Attachment ConjunctionScorer.java [ 12314671 ]
Paul Elschot added a comment - 03/Oct/05 12:08 AM
I like performance, so I'm all for this.
One thing about the code: it has a few probably
superfluous casts to (Scorer) from Scorer array elements.

Regards,
Paul Elschot


Paul Elschot added a comment - 03/Oct/05 04:22 AM
Simplified version, removed most casts, inlined first() and last().
Passes all tests here.

Regards,
Paul Elschot


Paul Elschot made changes - 03/Oct/05 04:22 AM
Attachment ConjunctionScorer.java [ 12314681 ]
Abdul Chaudhry added a comment - 04/Oct/05 07:40 AM
Thanks Paul, re-ran load tests here, everything looks good and it looks easier to read as well.
I also checked Arrays.sort, it's a merge sort - guaranteed nlogn.
This all looks squeaky clean.
Cheers,
– Abdul

Paul Elschot added a comment - 11/Oct/05 03:15 AM
The score() method can be simplified to use only the 'i' index. There is no need
to offset the index into the scorers with the 'pos' variable, because all scorers match.

Regards,
Paul Elschot


Abdul Chaudhry added a comment - 11/Oct/05 03:36 PM
ok, this makes sense as the scoring engine runs something like this

while (scorer.next()) {
int doc = scorer.doc();
float scorer = scorer.score();
collector.collect(doc, score);
}

That is, next() will have ordered everything, so that by the time we call the scorer.score() method , everything should be in-order.

Thanks, ill give that a go.

The impression I have with lucene, and correct me if Im wrong, is that complex queries with many terms and clauses have their bottleneck in terms of performance in the ordering phase, that is scorer.next() requires everything to be in-document order and all the scorer sub-engines must comply. Collection is a moot point as you probably have small numbers of hits. However, on the other end of the scale, for queries with one or two terms that have a very high frequency the bottleneck is really in collection, that is the priority queue in collector.collect(),
Essentially this is a sorting issue, somewhat masked and manipulated at various stages.
This looks to me like lucene needs a "Query Plan".


Paul Elschot added a comment - 11/Oct/05 04:44 PM
Once the ConjunctionScorer matches the order of the subqueries/subscorers does not matter
because they all match and a sum score needs to be formed.

Scorer.next() can only tolerate documents not to be in order at top level, where hit collection is done.
At lower levels in nested scorers, Lucene works a document at a time, and there Scorer.skipTo(docNr)
requires that all document numbers are in order. Such skipping is needed for conjunctions.
Since the score value of a document for a query depends on the score value of the subqueries,
at some point the association on a single document must be done.
For conjunctions, skipTo() is used, but for disjunctions this association is done by
a priority queue in the trunk, and a distribution like method in 1.4.3. This distribution method works
somewhat loosely on document order, and is therefore incompatible with skipping.

Regards,
Paul Elschot


Yonik Seeley added a comment - 28/Dec/05 04:48 AM
It seems like more cruft could be removed if support for incremental adding of subscorers was removed.
Is there a reason we can't simplify things and just pass Collection<Scorer> in the constructor?

Grant Ingersoll added a comment - 21/Sep/06 01:09 AM
Yonik, Paul, do either of you know the status on this one? From the looks of it, it hasn't been implemented. It also has the highest number of votes in JIRA, so I thought I would take a look at it. One downside is it is not in patch form, but it also doesn't look to hard to extract the changes, either.

One issue I have with these performance issues is that we don't have a reliable benchmarking suite. I am not a lawyer, but might we be able to use something like http://trec.nist.gov/data/reuters/reuters.html to build a sample benchmark suite? This corpus, plus 100 or so queries could work nicely. Of course, we would have to figure out some way for those interested to get their hands on the data. What do others do for benchmarking?


Paul Elschot added a comment - 21/Sep/06 07:17 AM
Iirc the orginal performance problem was caused by creation of objects in the tight loop
doing skipTo() on al the scorers.

This patch is against current trunk and based on the earlier posted versions of ConjunctionScorer.
which was based (by the first poster) on an existing ConjunctionScorer with an ASL notice,
which is why I could grant the licence to ASF.


Paul Elschot made changes - 21/Sep/06 07:17 AM
Attachment Conjunction20060921.patch [ 12341256 ]
Paul Elschot added a comment - 21/Sep/06 07:20 AM
I just overlooked the grant by Abdul to the ASF.

Repository Revision Date User Message
ASF #465043 Tue Oct 17 20:50:52 UTC 2006 yonik ConjunctionScorer performance increase: LUCENE-443
Files Changed
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/search/ConjunctionScorer.java
MODIFY /lucene/java/trunk/CHANGES.txt

Yonik Seeley added a comment - 17/Oct/06 08:51 PM
Thanks! I just committed this.

Yonik Seeley made changes - 17/Oct/06 08:51 PM
Status Open [ 1 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Fix Version/s 2.1 [ 12310854 ]
Assignee Yonik Seeley [ yseeley@gmail.com ]
Michael McCandless added a comment - 27/Feb/07 06:10 PM
Closing all issues that were resolved for 2.1.

Michael McCandless made changes - 27/Feb/07 06:10 PM
Status Resolved [ 5 ] Closed [ 6 ]