Solr
  1. Solr
  2. SOLR-5894

Speed up high-cardinality facets with sparse counters

    Details

      Description

      Multiple performance enhancements to Solr String faceting.

      • Sparse counters, switching the constant time overhead of extracting top-X terms with time overhead linear to result set size
      • Counter re-use for reduced garbage collection and lower per-call overhead
      • Optional counter packing, trading speed for space
      • Improved distribution count logic, greatly improving the performance of distributed faceting
      • In-segment threaded faceting
      • Regexp based white- and black-listing of facet terms
      • Heuristic faceting for large result sets

      Currently implemented for Solr 4.10. Source, detailed description and directly usable WAR at http://tokee.github.io/lucene-solr/

      This project has grown beyond a simple patch and will require a fair amount of co-operation with a committer to get into Solr. Splitting into smaller issues is a possibility.

      1. SOLR-5894.patch
        103 kB
        Toke Eskildsen
      2. SOLR-5894.patch
        102 kB
        Toke Eskildsen
      3. SOLR-5894_test.zip
        53 kB
        Toke Eskildsen
      4. SOLR-5894.patch
        97 kB
        Toke Eskildsen
      5. sparse_2000000docs_fc_cutoff_20140403-145412.png
        11 kB
        Toke Eskildsen
      6. SOLR-5894_test.zip
        52 kB
        Toke Eskildsen
      7. SOLR-5894.patch
        108 kB
        Toke Eskildsen
      8. sparse_5000000docs_20140331-151918_single.png
        13 kB
        Toke Eskildsen
      9. sparse_5000000docs_20140331-151918_multi.png
        13 kB
        Toke Eskildsen
      10. SOLR-5894_test.zip
        48 kB
        Toke Eskildsen
      11. SOLR-5894.patch
        97 kB
        Toke Eskildsen
      12. SOLR-5894_test.zip
        45 kB
        Toke Eskildsen
      13. sparse_50510000docs_20140328-152807.png
        13 kB
        Toke Eskildsen
      14. SOLR-5894.patch
        75 kB
        Toke Eskildsen
      15. SOLR-5894_test.zip
        45 kB
        Toke Eskildsen
      16. SOLR-5894.patch
        72 kB
        Toke Eskildsen
      17. SOLR-5894.patch
        17 kB
        Toke Eskildsen
      18. author_7M_tags_1852_logged_queries_warmed.png
        10 kB
        Toke Eskildsen
      19. SOLR-5894.patch
        15 kB
        Toke Eskildsen

        Activity

        Hide
        Manuel Lenormand added a comment -

        I'll be pleased to be updated about the 4.10.x migration, I'll be watching the issue. We have a 40 shards collection, 3TB/100M docs. As you can notice from JIRA issues I've opened, scalability and performance is our main concern and it's nice seeing others dealing with harder use cases than ours and writing about it.

        Show
        Manuel Lenormand added a comment - I'll be pleased to be updated about the 4.10.x migration, I'll be watching the issue. We have a 40 shards collection, 3TB/100M docs. As you can notice from JIRA issues I've opened, scalability and performance is our main concern and it's nice seeing others dealing with harder use cases than ours and writing about it.
        Hide
        Toke Eskildsen added a comment -

        Manuel: Thank you. We use it in out main search system (single shard, 50GB/15M docs) as well as our net archive index (25 shard, 22TB/7b docs). It has worked very well for us without any discovered (or at least reported) problems.

        After a hiatus (read: Other projects had higher priority), I have begun hacking on the project again. I am untangling the sparse code from the vanilla Solr faceting code and plan to migrate to 4.10.x after that. 5.0/5.1 is another matter as we do not use that in production anywhere.

        Show
        Toke Eskildsen added a comment - Manuel: Thank you. We use it in out main search system (single shard, 50GB/15M docs) as well as our net archive index (25 shard, 22TB/7b docs). It has worked very well for us without any discovered (or at least reported) problems. After a hiatus (read: Other projects had higher priority), I have begun hacking on the project again. I am untangling the sparse code from the vanilla Solr faceting code and plan to migrate to 4.10.x after that. 5.0/5.1 is another matter as we do not use that in production anywhere.
        Hide
        Andrew Jackson added a comment -

        I'm out of the office with intermittent email access, but I will be back in the office on Monday the 13th of April. For urgent technical matter please contact Roger Coram or Gil Hoggarth, and for urgent non-technical matters, Helen Hockx-Yu.

        ******************************************************************************************************************
        Experience the British Library online at www.bl.uk<http://www.bl.uk/>
        The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
        Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
        The Library's St Pancras site is WiFi - enabled
        *****************************************************************************************************************
        The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<postmaster@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
        The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
        *****************************************************************************************************************
        Think before you print

        Show
        Andrew Jackson added a comment - I'm out of the office with intermittent email access, but I will be back in the office on Monday the 13th of April. For urgent technical matter please contact Roger Coram or Gil Hoggarth, and for urgent non-technical matters, Helen Hockx-Yu. ****************************************************************************************************************** Experience the British Library online at www.bl.uk< http://www.bl.uk/ > The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html< http://www.bl.uk/aboutus/annrep/index.html > Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook< http://www.bl.uk/adoptabook > The Library's St Pancras site is WiFi - enabled ***************************************************************************************************************** The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk< postmaster@bl.uk > : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. ***************************************************************************************************************** Think before you print
        Hide
        Manuel Lenormand added a comment -

        Very impressive work. The idea behind it and implementation.

        The feature is still running on your production? Any other insights since? Have you ported it to 4.10.x? It seems I have a similar issue, benchmarking with this patch looks interesting

        Show
        Manuel Lenormand added a comment - Very impressive work. The idea behind it and implementation. The feature is still running on your production? Any other insights since? Have you ported it to 4.10.x? It seems I have a similar issue, benchmarking with this patch looks interesting
        Hide
        Toke Eskildsen added a comment -

        Status update

        The code seems to be in working state for Solr 4.8 and 4.9 for Field Cache faceting (single- and multi-value) and DocValue faceting (single- and multi-value) for Strings. It really needs testing by someone else than me, so that the validity of the responses and the speed upc can be verified. I am willing to port this to trunk if a committer is willing to work on getting the patch into Solr, but until then I will stick to stable versions as that is what we use locally.

        Sparse faceting introduces some architectural changes that needs to be addressed.

        1) The core sparse counting itself is surprisingly non-invasive. It could be a standalone patch. However, this only works really well in a non-distributed setting and has less effect in a distributed one.

        2) Avoiding re-allocation of counter structures by maintaining a pool of structures lowers the minimum faceting time and lowers the need for GC. It is lowered quite a lot, I might add, as faceting is normally one of the big GC-triggers, so I would strongly recommend this feature.
        Such a pool is very much like a cache and as such must play nice with index updates and other caching structures in Solr. The current state of the counter pool is that it works well, but is implemented independently of the usual caches. It is probably better to latch on to the standard caching mechanisms in Solr, but I have no experience with that part of the code base. The code for the pool itself is fairly simple and can be implemented as an experimental feature by setting the default pool size to 0.

        3) Fast distributed faceting is about speeding up the secondary phase of distrubuted faceting: The fine count for specific terms. This is achieved by replacing the normal search-for-each-requested-term approach by a full faceted count approach, where the counts for all terms are calculated but only the counts for the specified terms are returned. This requires either a separation of counting and extraction in stock Solr code or an extension to the count-extract code block so that is can handle both the case where top-X terms are wanted and where specific terms are requested. Currently the sparse code use the latter approach, but the former is a much cleaner architecture. Both solutions does require some re-factoring and are a bit tricky to get right.

        4) A further speed up for distributed faceting can be achieved by upgrading the sparse pools to real caches, so that they contain a mix of empty counters (to avoid GC) and filled counters (caching of counts). As the counting part of the second phase is always exactly the same as for the first phase, cache hit-rate should be very high with distribution. This is not currently implemented and is a bit tricky to get to work well.
        I recommend that this improvement is postponed, in order to keep the current patch manageable.

        The sparse principle would also work for Field Cache per Segment faceting of Strings, but no code has been written for this yet. I recommend that this is also postponed, to get things moving.

        Show
        Toke Eskildsen added a comment - Status update The code seems to be in working state for Solr 4.8 and 4.9 for Field Cache faceting (single- and multi-value) and DocValue faceting (single- and multi-value) for Strings. It really needs testing by someone else than me, so that the validity of the responses and the speed upc can be verified. I am willing to port this to trunk if a committer is willing to work on getting the patch into Solr, but until then I will stick to stable versions as that is what we use locally. Sparse faceting introduces some architectural changes that needs to be addressed. 1) The core sparse counting itself is surprisingly non-invasive. It could be a standalone patch. However, this only works really well in a non-distributed setting and has less effect in a distributed one. 2) Avoiding re-allocation of counter structures by maintaining a pool of structures lowers the minimum faceting time and lowers the need for GC. It is lowered quite a lot, I might add, as faceting is normally one of the big GC-triggers, so I would strongly recommend this feature. Such a pool is very much like a cache and as such must play nice with index updates and other caching structures in Solr. The current state of the counter pool is that it works well, but is implemented independently of the usual caches. It is probably better to latch on to the standard caching mechanisms in Solr, but I have no experience with that part of the code base. The code for the pool itself is fairly simple and can be implemented as an experimental feature by setting the default pool size to 0. 3) Fast distributed faceting is about speeding up the secondary phase of distrubuted faceting: The fine count for specific terms. This is achieved by replacing the normal search-for-each-requested-term approach by a full faceted count approach, where the counts for all terms are calculated but only the counts for the specified terms are returned. This requires either a separation of counting and extraction in stock Solr code or an extension to the count-extract code block so that is can handle both the case where top-X terms are wanted and where specific terms are requested. Currently the sparse code use the latter approach, but the former is a much cleaner architecture. Both solutions does require some re-factoring and are a bit tricky to get right. 4) A further speed up for distributed faceting can be achieved by upgrading the sparse pools to real caches, so that they contain a mix of empty counters (to avoid GC) and filled counters (caching of counts). As the counting part of the second phase is always exactly the same as for the first phase, cache hit-rate should be very high with distribution. This is not currently implemented and is a bit tricky to get to work well. I recommend that this improvement is postponed, in order to keep the current patch manageable. The sparse principle would also work for Field Cache per Segment faceting of Strings, but no code has been written for this yet. I recommend that this is also postponed, to get things moving.
        Hide
        Toke Eskildsen added a comment -

        Forked Lucene/Solr on GitHub for better project management: https://github.com/tokee/lucene-solr . Experimental branches for sparse faceting are lucene_solr_4_8_sparse and lucene_solr_4_9_sparse and will contain the latest code. The patch above for 4.7.1 is a little behind the GitHub branches, as it does not have speed up for second phase (getting counts for specific terms) of distributed faceting.

        Show
        Toke Eskildsen added a comment - Forked Lucene/Solr on GitHub for better project management: https://github.com/tokee/lucene-solr . Experimental branches for sparse faceting are lucene_solr_4_8_sparse and lucene_solr_4_9_sparse and will contain the latest code. The patch above for 4.7.1 is a little behind the GitHub branches, as it does not have speed up for second phase (getting counts for specific terms) of distributed faceting.
        Hide
        Toke Eskildsen added a comment -

        Changed the caching of counters to be field-specific, which works a lot better for multi-field faceting requests. Introduced facet.sparse.maxtracked to limit the maximum count for any facet value: This limits memory consumption at the cost of accuracy.

        Patch is for Solr 4.7.1.

        Show
        Toke Eskildsen added a comment - Changed the caching of counters to be field-specific, which works a lot better for multi-field faceting requests. Introduced facet.sparse.maxtracked to limit the maximum count for any facet value: This limits memory consumption at the cost of accuracy. Patch is for Solr 4.7.1.
        Hide
        Toke Eskildsen added a comment -

        Changed mincount in distributed faceting to not force mincount=0, resulting in sparse faceting also bringing speedup to distributed faceting.

        Added support for maxTracked, optionally setting a ceiling for the count for any term in a given facet, thereby lowering memory requirements and increasing speed, at the cost of accuracy.

        Still on Solr 4.7.1.

        Show
        Toke Eskildsen added a comment - Changed mincount in distributed faceting to not force mincount=0, resulting in sparse faceting also bringing speedup to distributed faceting. Added support for maxTracked, optionally setting a ceiling for the count for any term in a given facet, thereby lowering memory requirements and increasing speed, at the cost of accuracy. Still on Solr 4.7.1.
        Hide
        Toke Eskildsen added a comment -

        Updated patch to solr 4.7.1 and added optional use of PackedInts for counters (saves memory but affects performance).

        Show
        Toke Eskildsen added a comment - Updated patch to solr 4.7.1 and added optional use of PackedInts for counters (saves memory but affects performance).
        Hide
        Toke Eskildsen added a comment -

        Updated patch to support adjustable heuristic guessing of result size and selection of implementation based on the guess. Conservative guesses means early fallback to standard Solr faceting speed, overly optimistic guesses means a performance penalty on wrong guess. Correct guesses means that faceting with sparse counting is either faster or the same speed as standard Solr faceting. See attached graph sparse_2000000docs_fc_cutoff_20140403-145412.png for that scenario.

        Show
        Toke Eskildsen added a comment - Updated patch to support adjustable heuristic guessing of result size and selection of implementation based on the guess. Conservative guesses means early fallback to standard Solr faceting speed, overly optimistic guesses means a performance penalty on wrong guess. Correct guesses means that faceting with sparse counting is either faster or the same speed as standard Solr faceting. See attached graph sparse_2000000docs_fc_cutoff_20140403-145412.png for that scenario.
        Hide
        Toke Eskildsen added a comment -

        Patch-update: Sparse counting is now possible for field cache, with and without DocValues, single- as well as multi-field. Segment based field cache faceting support is not implemented yet.

        See the attached graphs sparse_5000000docs_20140331-151918_multi.png and sparse_5000000docs_20140331-151918_single.png for experiments with 5M values, multi-value (3 values/doc) and single-value. Note that the y-axis is now logarithmic. It seems DocValues benefits nearly as much from sparse counters as non-DocValues.

        Also note that these measurements are from an artificially large sparse tracker (100% overhead) and as such are not representative for a realistic setup.

        Show
        Toke Eskildsen added a comment - Patch-update: Sparse counting is now possible for field cache, with and without DocValues, single- as well as multi-field. Segment based field cache faceting support is not implemented yet. See the attached graphs sparse_5000000docs_20140331-151918_multi.png and sparse_5000000docs_20140331-151918_single.png for experiments with 5M values, multi-value (3 values/doc) and single-value. Note that the y-axis is now logarithmic. It seems DocValues benefits nearly as much from sparse counters as non-DocValues. Also note that these measurements are from an artificially large sparse tracker (100% overhead) and as such are not representative for a realistic setup.
        Hide
        Toke Eskildsen added a comment -

        Patch with several bugfixes and updated test scripts.

        The chart sparse_50510000docs_20140328-152807.png shows results form a test run on 50M documents / 150M unique tags. The black line is standard Solr, the red is a sparse counter with 100% overhead, the cyan is an attempt of compromising with 8% sparse overhead.

        The performance hit for passing the 8% cut-off is huge. One way to avoid it could be to make a numhits-based guess of the expected number of unique tags, when choosing facet implementation. The consequence of a wrong guess would "just" be poorer performance, with the worst-case being the max of the black and the cyan line for any given number of unique tags in the search result.

        Show
        Toke Eskildsen added a comment - Patch with several bugfixes and updated test scripts. The chart sparse_50510000docs_20140328-152807.png shows results form a test run on 50M documents / 150M unique tags. The black line is standard Solr, the red is a sparse counter with 100% overhead, the cyan is an attempt of compromising with 8% sparse overhead. The performance hit for passing the 8% cut-off is huge. One way to avoid it could be to make a numhits-based guess of the expected number of unique tags, when choosing facet implementation. The consequence of a wrong guess would "just" be poorer performance, with the worst-case being the max of the black and the cyan line for any given number of unique tags in the search result.
        Hide
        Toke Eskildsen added a comment -

        Bugfix and further instrumentation of the sparse implementation. The attached ZIP contains scripts for artificial testing of performance (requires bash, curl & gnuplot).

        Show
        Toke Eskildsen added a comment - Bugfix and further instrumentation of the sparse implementation. The attached ZIP contains scripts for artificial testing of performance (requires bash, curl & gnuplot).
        Hide
        Toke Eskildsen added a comment -

        Slightly updated patch where statistics are returned as part of the facet result if facet.sparse.stats=true. Scripts for easy testing of performance will follow later.

        Show
        Toke Eskildsen added a comment - Slightly updated patch where statistics are returned as part of the facet result if facet.sparse.stats=true. Scripts for easy testing of performance will follow later.
        Hide
        David Smiley added a comment -

        Awesome work Toke!

        Show
        David Smiley added a comment - Awesome work Toke!
        Hide
        Toke Eskildsen added a comment -

        Results from a small scale test of 2 threads sending logged queries against our ~50GB, 11M document library index, faceting on 'author' with 7M unique values. The test was executed multiple times and the last (and thus fully warmed) results were used. Only searches with facet results were counted.

        Results for this particular test were median response times of 36ms for standard facet counting and 24ms for sparse. See the attached PNG for graphs. Your Mileage Will Vary.

        Show
        Toke Eskildsen added a comment - Results from a small scale test of 2 threads sending logged queries against our ~50GB, 11M document library index, faceting on 'author' with 7M unique values. The test was executed multiple times and the last (and thus fully warmed) results were used. Only searches with facet results were counted. Results for this particular test were median response times of 36ms for standard facet counting and 24ms for sparse. See the attached PNG for graphs. Your Mileage Will Vary.
        Hide
        Toke Eskildsen added a comment -

        Simple proof of concept patch for sparse counting for field-based faceting on high-cardinality multi-value fields.

        Apply patch, build Solr, open an index with a proper high-cardinality field (> 100.000 unique values) such as author or title. Standard field based faceting works as always.

        Activate sparse counting by appending facet.sparse=true to the request. Control the amount of extra space used for the sparse structure with facet.sparse.fraction (default is 0.025, which seems to work well).

        Tested against Solr 4.6.1. Not tested against 4.7.

        Show
        Toke Eskildsen added a comment - Simple proof of concept patch for sparse counting for field-based faceting on high-cardinality multi-value fields. Apply patch, build Solr, open an index with a proper high-cardinality field (> 100.000 unique values) such as author or title. Standard field based faceting works as always. Activate sparse counting by appending facet.sparse=true to the request. Control the amount of extra space used for the sparse structure with facet.sparse.fraction (default is 0.025, which seems to work well). Tested against Solr 4.6.1. Not tested against 4.7.

          People

          • Assignee:
            Unassigned
            Reporter:
            Toke Eskildsen
          • Votes:
            9 Vote for this issue
            Watchers:
            25 Start watching this issue

            Dates

            • Created:
              Updated:

              Development