Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5637

Scaling scale function

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • 4.8
    • None
    • New, Patch Available

    Description

      The existing scale() function examines the scores of all documents in the index in order to calculate its scale constant. This does not perform well in solr on very large indexes or with costly scoring mechanisms such as geo distance.

      I have developed a patch that allows the scale function to only score documents that match the given filters, thus improving performance of the scale function.

      For test queries involving two scale operations where one was scaling the result of keyword scoring and the other was scaling the result of geo distance scoring on an index with ~2 million documents, query time was improved from ~400 ms with vanilla scale to ~190 ms with new scale. A similar query using no scaling ran in ~90 ms. (Each enhanced scale function added to the query appeared to add about 50 ms of processing)
      e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
      e.g. unscaled query - q = keywords and geo
      In both cases fq includes keywords and geo.

      In order to accomplish this goal I had to introduce a couple of changes:
      1) In the indexsearcher.search method where scorers are created and then used to score on a per-atomicreadercontext basis I had to make it so that all scorers would be created before any scoring was done. This was so that the scale function would have an opportunity to observe the entire index before being asked to score something.
      2) Introduced a new property to the Bits interface that indicates whether or not the bits provide constant-time access. Why? Read on.
      3) FilterSet used to return Null when asked for its bits because it did not have any, it had an iterator. This was an issue when trying to make it so that scale would only score documents matching the filter. Thus a new bits implementation was added (LazyIteratorBackedBits) that could expose an iterator as a Bits implementation. It advances the iterator on-demand when asked about a document and uses an OpenBitSet to keep track of what it has advanced beyond. Thus once the iterator is exhausted it provides constant-time answers like any other Bits.
      4) Introduced a function on the ValueSource interface to allow a Bits to be passed in for filtering purposes.

      This was originally developed against Solr 4.2 but I have ported it to Solr 4.8. There is one failing unit test related to code that has been added in the interim, AnalyzingInfixSuggesterTest.testRandomNRT. I have not been able to figure out why this test fails. All other tests pass.

      In relation to implementation detail 1) above, the introduction of LeafCollectors in trunk has caused somewhat of an issue. ( LUCENE-5527 ) It seems to no longer be possible to create multiple scorers without immediately scoring on that LeafCollector. This may be related to the encapsulation of the Collector.setNextReader() method which was very useful for this purpose.

      Attachments

        1. Lucene-5637.patch
          34 kB
          Chris Russell

        Activity

          People

            Unassigned Unassigned
            selah Chris Russell
            Votes:
            4 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: