[LUCENE-5637] Scaling scale function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 4.8
Component/s: None
Labels:
- patch
- performance

Lucene Fields:

New, Patch Available

Description

The existing scale() function examines the scores of all documents in the index in order to calculate its scale constant. This does not perform well in solr on very large indexes or with costly scoring mechanisms such as geo distance.

I have developed a patch that allows the scale function to only score documents that match the given filters, thus improving performance of the scale function.

For test queries involving two scale operations where one was scaling the result of keyword scoring and the other was scaling the result of geo distance scoring on an index with ~2 million documents, query time was improved from ~400 ms with vanilla scale to ~190 ms with new scale. A similar query using no scaling ran in ~90 ms. (Each enhanced scale function added to the query appeared to add about 50 ms of processing)
e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
e.g. unscaled query - q = keywords and geo
In both cases fq includes keywords and geo.

In order to accomplish this goal I had to introduce a couple of changes:
1) In the indexsearcher.search method where scorers are created and then used to score on a per-atomicreadercontext basis I had to make it so that all scorers would be created before any scoring was done. This was so that the scale function would have an opportunity to observe the entire index before being asked to score something.
2) Introduced a new property to the Bits interface that indicates whether or not the bits provide constant-time access. Why? Read on.
3) FilterSet used to return Null when asked for its bits because it did not have any, it had an iterator. This was an issue when trying to make it so that scale would only score documents matching the filter. Thus a new bits implementation was added (LazyIteratorBackedBits) that could expose an iterator as a Bits implementation. It advances the iterator on-demand when asked about a document and uses an OpenBitSet to keep track of what it has advanced beyond. Thus once the iterator is exhausted it provides constant-time answers like any other Bits.
4) Introduced a function on the ValueSource interface to allow a Bits to be passed in for filtering purposes.

This was originally developed against Solr 4.2 but I have ported it to Solr 4.8. There is one failing unit test related to code that has been added in the interim, AnalyzingInfixSuggesterTest.testRandomNRT. I have not been able to figure out why this test fails. All other tests pass.

In relation to implementation detail 1) above, the introduction of LeafCollectors in trunk has caused somewhat of an issue. ( ~~LUCENE-5527~~ ) It seems to no longer be possible to create multiple scorers without immediately scoring on that LeafCollector. This may be related to the encapsulation of the Collector.setNextReader() method which was very useful for this purpose.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Lucene-5637.patch
01/May/14 20:34
34 kB
Chris Russell

Activity

People

Assignee:: Unassigned

Reporter:: Chris Russell

Votes:: 4 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/May/14 20:26

Updated:: 28/Aug/22 14:06