Lucene - Core
  1. Lucene - Core
  2. LUCENE-893

Increase buffer sizes used during searching

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.1
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff of LUCENE-888.

      In LUCENE-888 we increased buffer sizes that impact indexing and found
      substantial (10-18%) overall performance gains.

      It's very likely that we can also gain some performance for searching
      by increasing the read buffers in BufferedIndexInput used by
      searching.

      We need to test performance impact to verify and then pick a good
      overall default buffer size, also being careful not to add too much
      overall HEAP RAM usage because a potentially very large number of
      BufferedIndexInput instances are created during searching
      (# segments X # index files per segment).

        Activity

        Hide
        Doug Cutting added a comment -

        > large number of BufferedIndexInput instances are created during searching
        > (# segments X # index files per segment)

        The biggest factor is frequently the number of terms in the query.

        Show
        Doug Cutting added a comment - > large number of BufferedIndexInput instances are created during searching > (# segments X # index files per segment) The biggest factor is frequently the number of terms in the query.
        Hide
        Michael McCandless added a comment -

        > > large number of BufferedIndexInput instances are created during searching
        > > (# segments X # index files per segment)
        >
        > The biggest factor is frequently the number of terms in the query.

        Ahh because we clone the freq/prox streams per term in the query... OK.

        Show
        Michael McCandless added a comment - > > large number of BufferedIndexInput instances are created during searching > > (# segments X # index files per segment) > > The biggest factor is frequently the number of terms in the query. Ahh because we clone the freq/prox streams per term in the query... OK.
        Hide
        Michael Busch added a comment -

        I ran some performance tests with the same setup I used for LUCENE-866:

        • 1.2 GB index, optimized, compound format, documents from Wikipedia
        • 50,000 queries, each query has 3 AND terms, each term has a df>100,
          each query has one or more hits
        • 2.8 GHz Xeon, 4 GB RAM, SCSI HD, Windows Server 2003

        My tests simply executes all 50k queries in a row and measures the
        overall time. I used the current trunk version patched with LUCENE-888
        and LUCENE-866 and varied the buffer size of the cfs reader.
        Here are the results:

        1 KB: Time: 51703 ms.
        2 KB: Time: 50672 ms.
        4 KB: Time: 50969 ms.
        8 KB: Time: 57047 ms.
        16 KB: Time: 64547 ms.

        I seems that it doesn't really matter if the buffer size is 1 KB, 2 KB,
        or 4 KB. Above 4 KB the performance decreases significantly.

        Now the same test with a cfs reader buffer of 1 KB and varying buffer
        sizes for the freq stream in SegmentTermDocs:

        1 KB: Time: 51875 ms.
        2 KB: Time: 46828 ms.
        4 KB: Time: 44500 ms.
        8 KB: Time: 50953 ms.
        16 KB: Time: 64485 ms.

        With 4 KB there is a performance improvement of 14%! But considering
        the fact that this stream is cloned for every query term, I think
        that 2 KB is the better choice, still a 10% improvement.

        Now I simply vary the readBufferSize for all buffered inputs:

        1 KB: Time: 51778 ms.
        2 KB: Time: 46172 ms.
        4 KB: Time: 49000 ms.
        8 KB: Time: 52187 ms.
        16 KB: Time: 69562 ms.

        Now the same test with 50k disjunction queries, 3 terms per query:

        1 KB: Time: 288422 ms.
        2 KB: Time: 259672 ms.
        4 KB: Time: 279563 ms.

        2 KB for all input buffers seems to be a good compromise. It's about
        10% faster than 1 KB for both types of queries.

        Question are:

        • Can we afford the increased memory consumption?
        • Is 2 KB also the best choice on other systems?
        Show
        Michael Busch added a comment - I ran some performance tests with the same setup I used for LUCENE-866 : 1.2 GB index, optimized, compound format, documents from Wikipedia 50,000 queries, each query has 3 AND terms, each term has a df>100, each query has one or more hits 2.8 GHz Xeon, 4 GB RAM, SCSI HD, Windows Server 2003 My tests simply executes all 50k queries in a row and measures the overall time. I used the current trunk version patched with LUCENE-888 and LUCENE-866 and varied the buffer size of the cfs reader. Here are the results: 1 KB: Time: 51703 ms. 2 KB: Time: 50672 ms. 4 KB: Time: 50969 ms. 8 KB: Time: 57047 ms. 16 KB: Time: 64547 ms. I seems that it doesn't really matter if the buffer size is 1 KB, 2 KB, or 4 KB. Above 4 KB the performance decreases significantly. Now the same test with a cfs reader buffer of 1 KB and varying buffer sizes for the freq stream in SegmentTermDocs: 1 KB: Time: 51875 ms. 2 KB: Time: 46828 ms. 4 KB: Time: 44500 ms. 8 KB: Time: 50953 ms. 16 KB: Time: 64485 ms. With 4 KB there is a performance improvement of 14%! But considering the fact that this stream is cloned for every query term, I think that 2 KB is the better choice, still a 10% improvement. Now I simply vary the readBufferSize for all buffered inputs: 1 KB: Time: 51778 ms. 2 KB: Time: 46172 ms. 4 KB: Time: 49000 ms. 8 KB: Time: 52187 ms. 16 KB: Time: 69562 ms. Now the same test with 50k disjunction queries, 3 terms per query: 1 KB: Time: 288422 ms. 2 KB: Time: 259672 ms. 4 KB: Time: 279563 ms. 2 KB for all input buffers seems to be a good compromise. It's about 10% faster than 1 KB for both types of queries. Question are: Can we afford the increased memory consumption? Is 2 KB also the best choice on other systems?
        Hide
        Eks Dev added a comment -

        Michael,
        how many documents were in index?
        I assume in case where posting are longer bigger buffer brings more benefit, especially for queries that are on dense terms.

        I am bringing this up as optimizing for "general case" is hard/impossible due to too many scenarios, influences. Having said that:

        1. Maybe we constrain things a bit, e.g number of docs in index set to 10 Mio as a "golden standard" target for optimization (high number makes sense, as for smaller indexes it is already fast enough, 10% faster from stunning fast is just not all that relevant)... maybe some other ideas what could be considered.
        2. Making it configurable, so everybody could tweak it, you never know what HW or file system one uses...

        Show
        Eks Dev added a comment - Michael, how many documents were in index? I assume in case where posting are longer bigger buffer brings more benefit, especially for queries that are on dense terms. I am bringing this up as optimizing for "general case" is hard/impossible due to too many scenarios, influences. Having said that: 1. Maybe we constrain things a bit, e.g number of docs in index set to 10 Mio as a "golden standard" target for optimization (high number makes sense, as for smaller indexes it is already fast enough, 10% faster from stunning fast is just not all that relevant)... maybe some other ideas what could be considered. 2. Making it configurable, so everybody could tweak it, you never know what HW or file system one uses...
        Hide
        Yonik Seeley added a comment -

        Do you have any results for non-compound format?

        Perhaps CFS may have bigger gains because all CSIndexInput share the same base FSIndexInput, hence CFS needs to call file.seek() more often, even for sequential reads on a particular CSIndexInput?

        Show
        Yonik Seeley added a comment - Do you have any results for non-compound format? Perhaps CFS may have bigger gains because all CSIndexInput share the same base FSIndexInput, hence CFS needs to call file.seek() more often, even for sequential reads on a particular CSIndexInput?
        Hide
        Michael Busch added a comment -

        > how many documents were in index?

        ~1.1 million

        > I assume in case where posting are longer bigger buffer brings
        > more benefit, especially for queries that are on dense terms.

        Yes and no... of course for very short posting lists bigger buffers
        don't speed up things. But on the other hand, on long posting
        lists it is more likely that more skips are performed, so it is also
        questionable if increasing the buffer size always helps here.

        > I am bringing this up as optimizing for "general case" is
        > hard/impossible due to too many scenarios, influences.

        Yes, I definitely agree. I know that my tests are very specific
        to my hardware, OS, queries, documents... Here 2 KB is the magic
        number, maybe on other systems it's different. It'd be good if
        others could run some tests too on other systems.

        > 1. Maybe we constrain things a bit, e.g number of docs in index
        > set to 10 Mio as a "golden standard" target for optimization

        Having constraints is desirable. But maybe it might prevent others
        from running tests? It takes time to get 10M docs, build indexes
        with different settings, get good queries...

        > 2. Making it configurable, so everybody could tweak it, you never
        > know what HW or file system one uses...

        Yes, we could think about adding static getters/setters for
        readBufferSize and writeBufferSize.

        Show
        Michael Busch added a comment - > how many documents were in index? ~1.1 million > I assume in case where posting are longer bigger buffer brings > more benefit, especially for queries that are on dense terms. Yes and no... of course for very short posting lists bigger buffers don't speed up things. But on the other hand, on long posting lists it is more likely that more skips are performed, so it is also questionable if increasing the buffer size always helps here. > I am bringing this up as optimizing for "general case" is > hard/impossible due to too many scenarios, influences. Yes, I definitely agree. I know that my tests are very specific to my hardware, OS, queries, documents... Here 2 KB is the magic number, maybe on other systems it's different. It'd be good if others could run some tests too on other systems. > 1. Maybe we constrain things a bit, e.g number of docs in index > set to 10 Mio as a "golden standard" target for optimization Having constraints is desirable. But maybe it might prevent others from running tests? It takes time to get 10M docs, build indexes with different settings, get good queries... > 2. Making it configurable, so everybody could tweak it, you never > know what HW or file system one uses... Yes, we could think about adding static getters/setters for readBufferSize and writeBufferSize.
        Hide
        Michael Busch added a comment -

        > Do you have any results for non-compound format?

        I converted the index to non-cfs format and reran the same test,
        varying the readBufferSize (i. e. the buffer size for all input
        streams).

        Results for conjunction queries:

        1 KB: Time: 52422 ms.
        2 KB: Time: 47718 ms.
        4 KB: Time: 50078 ms.
        8 KB: Time: 54719 ms.

        Results for disjunction queries:

        1 KB: Time: 291000 ms.
        2 KB: Time: 266110 ms.
        4 KB: Time: 258844 ms.
        8 KB: Time: 279812 ms.

        The results look similar to compound format here. Again 2 KB looks
        like the best value.

        Show
        Michael Busch added a comment - > Do you have any results for non-compound format? I converted the index to non-cfs format and reran the same test, varying the readBufferSize (i. e. the buffer size for all input streams). Results for conjunction queries: 1 KB: Time: 52422 ms. 2 KB: Time: 47718 ms. 4 KB: Time: 50078 ms. 8 KB: Time: 54719 ms. Results for disjunction queries: 1 KB: Time: 291000 ms. 2 KB: Time: 266110 ms. 4 KB: Time: 258844 ms. 8 KB: Time: 279812 ms. The results look similar to compound format here. Again 2 KB looks like the best value.
        Hide
        robert engels added a comment -

        Some food for thought:

        A couple of runs of XBench on hardware that is radically difference in terms of raw performance shows a nearly 4x performance improvement using 256k blocks during sequential access. For random reads the numbers are closer to 20x.

        The trick is determining how much sequential data is (should) be read - the locality of data for the current query along with future queries, since even if Lucene reads extra unneeded data in this run, what is the chance that the data will be needed in future queries (thus having it already in the cache).

        It would seem that these numbers show the ideal solution would vary the buffer size when the engine determines that it is going to read a lot of sequential data (e.g. a wide open range query), and use smaller buffer sizes when it expects only a few results.

        Maybe this might shove Lucene down the path where the index is optimized so that common queries terms are always put in a separate segment/index providing a high degree of locality to optimize the reading. Maybe there is some academic research in this area?

        Disk Test 81.23
        Sequential 81.55
        Uncached Write 80.69 33.63 MB/sec [4K blocks]
        Uncached Write 80.94 33.15 MB/sec [256K blocks]
        Uncached Read 77.68 12.30 MB/sec [4K blocks]
        Uncached Read 87.48 35.35 MB/sec [256K blocks]
        Random 80.92
        Uncached Write 62.67 0.94 MB/sec [4K blocks]
        Uncached Write 89.93 20.28 MB/sec [256K blocks]
        Uncached Read 89.01 0.59 MB/sec [4K blocks]
        Uncached Read 89.93 18.51 MB/sec [256K blocks]

        Disk Test 48.34
        Sequential 47.83
        Uncached Write 39.10 16.30 MB/sec [4K blocks]
        Uncached Write 59.73 24.46 MB/sec [256K blocks]
        Uncached Read 38.72 6.13 MB/sec [4K blocks]
        Uncached Read 64.56 26.08 MB/sec [256K blocks]
        Random 48.87
        Uncached Write 35.51 0.53 MB/sec [4K blocks]
        Uncached Write 46.00 10.37 MB/sec [256K blocks]
        Uncached Read 66.61 0.44 MB/sec [4K blocks]
        Uncached Read 59.06 12.15 MB/sec [256K blocks]

        Show
        robert engels added a comment - Some food for thought: A couple of runs of XBench on hardware that is radically difference in terms of raw performance shows a nearly 4x performance improvement using 256k blocks during sequential access. For random reads the numbers are closer to 20x. The trick is determining how much sequential data is (should) be read - the locality of data for the current query along with future queries, since even if Lucene reads extra unneeded data in this run, what is the chance that the data will be needed in future queries (thus having it already in the cache). It would seem that these numbers show the ideal solution would vary the buffer size when the engine determines that it is going to read a lot of sequential data (e.g. a wide open range query), and use smaller buffer sizes when it expects only a few results. Maybe this might shove Lucene down the path where the index is optimized so that common queries terms are always put in a separate segment/index providing a high degree of locality to optimize the reading. Maybe there is some academic research in this area? Disk Test 81.23 Sequential 81.55 Uncached Write 80.69 33.63 MB/sec [4K blocks] Uncached Write 80.94 33.15 MB/sec [256K blocks] Uncached Read 77.68 12.30 MB/sec [4K blocks] Uncached Read 87.48 35.35 MB/sec [256K blocks] Random 80.92 Uncached Write 62.67 0.94 MB/sec [4K blocks] Uncached Write 89.93 20.28 MB/sec [256K blocks] Uncached Read 89.01 0.59 MB/sec [4K blocks] Uncached Read 89.93 18.51 MB/sec [256K blocks] Disk Test 48.34 Sequential 47.83 Uncached Write 39.10 16.30 MB/sec [4K blocks] Uncached Write 59.73 24.46 MB/sec [256K blocks] Uncached Read 38.72 6.13 MB/sec [4K blocks] Uncached Read 64.56 26.08 MB/sec [256K blocks] Random 48.87 Uncached Write 35.51 0.53 MB/sec [4K blocks] Uncached Write 46.00 10.37 MB/sec [256K blocks] Uncached Read 66.61 0.44 MB/sec [4K blocks] Uncached Read 59.06 12.15 MB/sec [256K blocks]
        Hide
        Paul Elschot added a comment -

        I think the different results of 26 May 2007 for conjunction queries and disjunction queries may be caused by the use of TermScorer.skipTo() in conjunctions and TermScorer.next() in disjunctions.

        That points to different optimal buffer sizes for conjunctions (smaller because of the skipping) and for disjunctions (larger because all postings are going to be needed).

        LUCENE-430 is about reducing term buffer size for the case when the buffer is not going to be used completely because of the small number of documents containing the term.

        In all, I think it makes sense to allow the (conjunction/disjunction)Scorer to choose the maximum buffer size for the term, and let the term itself choose a lower value when it needs less than that.

        Another way to promote sequential reading for disjunction queries is to process all their terms sequentially, i.e. one term at a time. In lucene this is currently done by Filters for prefix queries and ranges. Unfortunately this cannot be done when the combined frequency of the terms in each document is needed. In that case DisjunctionSumScorer could be used, with larger buffers on the terms that are contained in many documents.

        Show
        Paul Elschot added a comment - I think the different results of 26 May 2007 for conjunction queries and disjunction queries may be caused by the use of TermScorer.skipTo() in conjunctions and TermScorer.next() in disjunctions. That points to different optimal buffer sizes for conjunctions (smaller because of the skipping) and for disjunctions (larger because all postings are going to be needed). LUCENE-430 is about reducing term buffer size for the case when the buffer is not going to be used completely because of the small number of documents containing the term. In all, I think it makes sense to allow the (conjunction/disjunction)Scorer to choose the maximum buffer size for the term, and let the term itself choose a lower value when it needs less than that. Another way to promote sequential reading for disjunction queries is to process all their terms sequentially, i.e. one term at a time. In lucene this is currently done by Filters for prefix queries and ranges. Unfortunately this cannot be done when the combined frequency of the terms in each document is needed. In that case DisjunctionSumScorer could be used, with larger buffers on the terms that are contained in many documents.
        Hide
        Paul Elschot added a comment -

        The last case is also the one in which scoring docs allowed out of order using BooleanScorer is faster than using DisjunctionSumScorer. This option is already available, but it could have a bigger impact when term buffer sizes are chosen closer to optimal.

        Show
        Paul Elschot added a comment - The last case is also the one in which scoring docs allowed out of order using BooleanScorer is faster than using DisjunctionSumScorer. This option is already available, but it could have a bigger impact when term buffer sizes are chosen closer to optimal.
        Hide
        Trejkaz added a comment -

        Someone else on my team did some benchmarks for a query which was slow for us.

        We have a tree of items (one item per document), and given N items, we walk the path from the root to the item until finding an item with certain properties (expected average queries per item is around 3.) All the queries are against the same field (our unique ID field.)

        Timings of this process for 60,000 items:

        Buffer size Time
        1024 1015s
        2048 540s
        4096 335s
        8196 281s
        16384 278s
        32768 284s
        65536 322s

        Show
        Trejkaz added a comment - Someone else on my team did some benchmarks for a query which was slow for us. We have a tree of items (one item per document), and given N items, we walk the path from the root to the item until finding an item with certain properties (expected average queries per item is around 3.) All the queries are against the same field (our unique ID field.) Timings of this process for 60,000 items: Buffer size Time 1024 1015s 2048 540s 4096 335s 8196 281s 16384 278s 32768 284s 65536 322s
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development