We have a 6-Solr-node (release 4.4.0) setup with 12 billion "small" documents loaded across 3 collections. The documents have the following fields
- a_dlng_doc_sto (docvalue long)
- b_dlng_doc_sto (docvalue long)
- c_dstr_doc_sto (docvalue string)
- timestamp_lng_ind_sto (indexed long)
- d_lng_ind_sto (indexed long)
<dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false" stored="true" required="true" docValues="true"/> <dynamicField name="*_lng_ind_sto" type="long" indexed="true" stored="true"/> <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" stored="true" required="true" docValues="true"/> ... <fieldType name="dstring" class="solr.StrField" sortMissingLast="true" docValuesFormat="Disk"/> <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0" docValuesFormat="Disk"/>
timestamp_lng_ind_sto decides which collection documents go into
We execute queries on the following format:
- q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
We see very slow response-time when hitting large number of rows, spanning lots of facets, but only ask for "a few" of those facets
Concrete example of query to get some concrete numbers to look at
With x and y plus a, b ... n set to values so that
- The timestamp_lng_ind_sto:[x TO y] part of the search-criteria alone hit about 1.7 billion documents (actually all in one (containing 4.5 billion docs) of the three collections - but that is not important)
- The d_lng_ind_sto:(a OR b OR ... OR n) part of the search-criteria alone hit about 500000 documents
- The combined search-criteria (timestamp_lng_ind_sto AND'ed with d_lng_ind_sto) hit about 200000 documents
The following graph shows responsetime as a function of <asked-for-facets> (in query)
Note that responsetime is high for "low" <asked-for-facets>, and that it increases fast (but linearly) in <asked-for-facets> up until <asked-for-facets> is somewhere inbetween 5000 (where responsetime is close to 1000 secs) and 10000 (where responsetime is about 5 secs). For values of <asked-for-facets> above 10000 responsetime stays "low" at between 1-10 secs
Looking at the code and profiling it is clear that the change to better responsetime occurs when SimpleFacets.getFacetFieldCounts changes from using getListedTermCounts to using getTermCounts.
The following image shows profiling information during a request with <asked-for-facets> at about 2000.
- SimpleFacets.getListedTermCounts is used (green box)
- 91% of the time spent performing the query is spent in DocSetCollector-constructor (red box). During this concrete query 125000 DocSetCollection-objects are created spending 710 secs all in all. Additional investigations show that the time is spent allocating huge int-arrays for the "scratch"-int-array. Several thousands of those DocSetCollection-constructors create int-arrays at size above 1 million - that takes time, and also leaves a nice little job of the GC'er afterwards.
- The actual search-part of the query takes only 0.5% (4 secs) of the combined time executing the query (blue box)
The following image shows profiling information during a request with <asked-for-facets> at about 10000
- SimpleFacets.getTermCounts is used (green box)
- The actual search-part of the query now takes 70% (11 secs) of the combined time executing the query (blue box)
What to do about this?
- I am not sure why there are two paths that SimpleFacets.getFacetFieldCounts can take (getListedTermCounts or getTermCounts) - but I am pretty sure there is a good reason. It seems like getListedTermCounts is used when <asked-for-facets> is noticeable lower than the total number of facets hit (believe it is when <asked-for-facets> * 1.5 + 10 is below actual number of facets hit)
- One solution could be to just drop the getListedTermCounts-path and always go getTermCounts, but that is probably not at good idea, because getListedTermCounts is probably there for a performance reason (in other scenarios)
- The comment above DocSetCollection.scratch says
// in case there aren't that many hits, we may not want a very sparse // bit array. Optimistically collect the first few docs in an array // in case there are only a few. final int scratch;
The comment seems reasonable. But when we look at what values are used as "smallSetSize" for the DocSetCollection-constructor, it is always "maxDoc >> 6" (basically dividing by 64) - this value depends on maxDoc and will be high if maxDoc is high. In my case maxDoc is 50+ million a lot of the times resulting in "smallSetSize"s of 1+ million (that is not "a few"). I am very much in doubt why you want "smallSetSize" to increase as maxDoc increase - why not just always a low (fixed or something) value for "smallSetSize"? Is it ever a good idea with huge int-arrays for the "scratch"-array?
- Another solution would be to never create "scratch"-arrays with size above e.g. 50
- There are probably several other potential solutions
I would really want your opinion on what solution to make, so that I do not unintentionally break good performance-optimizations, just because I missed some points explaining why the code is as it is today!?
Note I have filed this as a 4.4 issues, because that is the platform I use for my tests etc. But I am sure the problem also exists on 4.5.1 (or whatever the latest 4.x release is)
- is related to
SOLR-8922 DocSetCollector can allocate massive garbage on large indexes