Lucene - Core
  1. Lucene - Core
  2. LUCENE-687

Performance improvement: Lazy skipping on proximity file

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1
    • Component/s: core/index
    • Labels:
      None

      Description

      Hello,

      I'm proposing a patch here that changes org.apache.lucene.index.SegmentTermPositions to avoid unnecessary skips and reads on the proximity stream. Currently a call of next() or seek(), which causes a movement to a document in the freq file also moves the prox pointer to the posting list of that document. But this is only necessary if actual positions have to be retrieved for that particular document.

      Consider for example a phrase query with two terms: the freq pointer for term 1 has to move to document x to answer the question if the term occurs in that document. But only if term 2 also matches document x, the positions have to be read to figure out if term 1 and term 2 appear next to each other in document x and thus satisfy the query.

      A move to the posting list of a document can be quite expensive. It has to be skipped to the last skip point before that document and then the documents between the skip point and the desired document have to be scanned, which means that the VInts of all positions of those documents have to be read and decoded.

      An improvement is to move the prox pointer lazily to a document only if nextPosition() is called. This will become even more important in the future when the size of the proximity file increases (e. g. by adding payloads to the posting lists).

      My patch implements this lazy skipping. All unit tests pass.

      I also attach a new unit test that works as follows:
      Using a RamDirectory an index is created and test docs are added. Then the index is optimized to make sure it only has a single segment. This is important, because IndexReader.open() returns an instance of SegmentReader if there is only one segment in the index. The proxStream instance of SegmentReader is package protected, so it is possible to set proxStream to a different object. I am using a class called SeeksCountingStream that extends IndexInput in a way that it is able to count the number of invocations of seek().

      Then the testcase searches the index using a PhraseQuery "term1 term2". It is known how many documents match that query and the testcase can verify that seek() on the proxStream is not called more often than number of search hits.

      Example:
      Number of docs in the index: 500
      Number of docs that match the query "term1 term2": 5

      Invocations of seek on prox stream (old code): 29
      Invocations of seek on prox stream (patched version): 5

      • Michael
      1. lazy_prox_skipping.patch
        9 kB
        Michael Busch
      2. lazy_prox_skipping2.patch
        2 kB
        Michael Busch

        Activity

        Hide
        Yonik Seeley added a comment -

        Awesome stuff Michael!

        This looks like it should speed up most phrase seaches. Do you have any performance numbers?
        Is the slowdown measurable in the worst case of a phrase query with both terms in every document?

        I'll try and review it more in-depth soon.

        Show
        Yonik Seeley added a comment - Awesome stuff Michael! This looks like it should speed up most phrase seaches. Do you have any performance numbers? Is the slowdown measurable in the worst case of a phrase query with both terms in every document? I'll try and review it more in-depth soon.
        Hide
        Michael Busch added a comment -

        Hi Yonik,

        thanks for the quick reply! I'm going to do performance tests and will give you some numbers soon.

        Show
        Michael Busch added a comment - Hi Yonik, thanks for the quick reply! I'm going to do performance tests and will give you some numbers soon.
        Hide
        Michael Busch added a comment -

        I made some performance experiments with phrase searches. For every query I measured the search time as well as the number of invocations of IndexInput.readVint(). While the search time is very specific for my test environment, the number of Vints should tell us how much I/O we safe independent from the actual hardware.

        Test environment:
        2x Intel Xeon CPU 2.80GHz
        4GB RAM
        280GB SCSI HD
        Windows Server 2003

        My index contains about 460,000 documents from different websites like www.cnn.com, www.joelonsoftware.com,..., also German sites like www.heise.de.
        The search times below are average values over 10 searches.

        Tests:

        • Query: "the der"
          This should be a hard query, because "the" appears in almost every English document whereas "der" in almost every German document.
          Number of hits: 15
          Execution time in ms (old, new, improvement): 651.3, 472.1, 28%
        1. if VInts (old, new, improvement): 25786474, 16007634, 38%
        • Query: "joel on software"
          "joel" only appears in a small subset of documents, "on" in all English documents, and "software" can appear in German and English documents.
          Number of hits: 1613
          Execution time in ms (old, new, improvement): 33.0, 20.1, 39%
        1. if VInts (old, new, improvement): 950243, 256481, 73%
        • Query: "much much more"
          This is an example where we don't save so much I/O, because it is likely that all terms appear in the most English documents.
          Number of hits: 76
          Execution time in ms (old, new, improvement): 228.2, 215.5, 6%
        1. if VInts (old, new, improvement): 3580288, 3067693, 14%

        I think these numbers look pretty good. In the future, when we hopefully have proximity scoring, every search would benefit from this improvement. (I actually have a running version of proximity scoring in my local code and verified this).

        Did you have some time to look into the patch yet, Yonik? Let me know if you need more information, please.

        Show
        Michael Busch added a comment - I made some performance experiments with phrase searches. For every query I measured the search time as well as the number of invocations of IndexInput.readVint(). While the search time is very specific for my test environment, the number of Vints should tell us how much I/O we safe independent from the actual hardware. Test environment: 2x Intel Xeon CPU 2.80GHz 4GB RAM 280GB SCSI HD Windows Server 2003 My index contains about 460,000 documents from different websites like www.cnn.com, www.joelonsoftware.com,..., also German sites like www.heise.de. The search times below are average values over 10 searches. Tests: Query: "the der" This should be a hard query, because "the" appears in almost every English document whereas "der" in almost every German document. Number of hits: 15 Execution time in ms (old, new, improvement): 651.3, 472.1, 28% if VInts (old, new, improvement): 25786474, 16007634, 38% Query: "joel on software" "joel" only appears in a small subset of documents, "on" in all English documents, and "software" can appear in German and English documents. Number of hits: 1613 Execution time in ms (old, new, improvement): 33.0, 20.1, 39% if VInts (old, new, improvement): 950243, 256481, 73% Query: "much much more" This is an example where we don't save so much I/O, because it is likely that all terms appear in the most English documents. Number of hits: 76 Execution time in ms (old, new, improvement): 228.2, 215.5, 6% if VInts (old, new, improvement): 3580288, 3067693, 14% I think these numbers look pretty good. In the future, when we hopefully have proximity scoring, every search would benefit from this improvement. (I actually have a running version of proximity scoring in my local code and verified this). Did you have some time to look into the patch yet, Yonik? Let me know if you need more information, please.
        Hide
        Yonik Seeley added a comment -

        I did some quick synthetic tests of my own.... Java5 -server

        100,000 documents, 25 different terms, prob of term in doc = 1/term_number, sloppy phrase on 2 random terms
        Improvement: -3.1%

        Change term probability to 1/(term_number**2)
        Improvement: 10.0%

        Test only terms 24,25 with term probability=1/term_number (so probabilities were 0.5 and 1.0)
        Imprivement: -8.4%

        I think your patch is still worth applying because the likely distribution of terms, but I just wish that the performance hit was smaller in the worst-case scenario.

        Show
        Yonik Seeley added a comment - I did some quick synthetic tests of my own.... Java5 -server 100,000 documents, 25 different terms, prob of term in doc = 1/term_number, sloppy phrase on 2 random terms Improvement: -3.1% Change term probability to 1/(term_number**2) Improvement: 10.0% Test only terms 24,25 with term probability=1/term_number (so probabilities were 0.5 and 1.0) Imprivement: -8.4% I think your patch is still worth applying because the likely distribution of terms, but I just wish that the performance hit was smaller in the worst-case scenario.
        Hide
        Yonik Seeley added a comment -

        Oh, those synthetic tests were done on a RAMDirectory, so that also reduces the benefits of your patch.

        Show
        Yonik Seeley added a comment - Oh, those synthetic tests were done on a RAMDirectory, so that also reduces the benefits of your patch.
        Hide
        Yonik Seeley added a comment -

        Reviewed and committed. Thanks Michael!

        Show
        Yonik Seeley added a comment - Reviewed and committed. Thanks Michael!
        Hide
        Michael Busch added a comment -

        Yonik,

        thanks for committing the patch!
        About the performance tests you did: I think even on a RAMDirectory the patch should not slow down any queries significantly. I did some profiling and I think that the frequent calls of lazySkip() are quite expensive. I attach a small patch that uses a boolean variable "needsSkip" to check whether lazySkip() actually needs to be called. Could you rerun your experiments with this version?

        All unit tests pass with this patch.

        Show
        Michael Busch added a comment - Yonik, thanks for committing the patch! About the performance tests you did: I think even on a RAMDirectory the patch should not slow down any queries significantly. I did some profiling and I think that the frequent calls of lazySkip() are quite expensive. I attach a small patch that uses a boolean variable "needsSkip" to check whether lazySkip() actually needs to be called. Could you rerun your experiments with this version? All unit tests pass with this patch.
        Hide
        Yonik Seeley added a comment -

        I tried with the boolean, but it made no difference with 1.5 or 1.6 server JVMs. The overhead is probably due to increased state in the object as well as an unpredictable branch in the inner loop.
        I'm not too worried though... my artificial test isn't really representative if the common usecases of phrase queries on full-text fields which have more varied term frequencies and more positions per document.

        Show
        Yonik Seeley added a comment - I tried with the boolean, but it made no difference with 1.5 or 1.6 server JVMs. The overhead is probably due to increased state in the object as well as an unpredictable branch in the inner loop. I'm not too worried though... my artificial test isn't really representative if the common usecases of phrase queries on full-text fields which have more varied term frequencies and more positions per document.
        Hide
        Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Michael Busch
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development