Solr
  1. Solr
  2. SOLR-221

faceting memory and performance improvement

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: None
    • Labels:
      None

      Description

      1) compare minimum count currently needed to the term df and avoid unnecessary intersection count
      2) set a minimum term df in order to use the filterCache, otherwise iterate over TermDocs

      1. facet.patch
        12 kB
        Yonik Seeley
      2. facet.patch
        3 kB
        Yonik Seeley

        Activity

        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.2

        The Fix Version for all 39 issues found was set to 1.2, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman2

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.2 The Fix Version for all 39 issues found was set to 1.2, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman2
        Hide
        Yonik Seeley added a comment -

        TODO: after committed, document warming tips due to change #1 "compare minimum count currently needed to the term df and avoid unnecessary intersection count"

        Show
        Yonik Seeley added a comment - TODO: after committed, document warming tips due to change #1 "compare minimum count currently needed to the term df and avoid unnecessary intersection count"
        Hide
        Yonik Seeley added a comment -

        Changed config to a SolrParam facet.enum.cache.minDf
        and added some tests.

        Show
        Yonik Seeley added a comment - Changed config to a SolrParam facet.enum.cache.minDf and added some tests.
        Hide
        Yonik Seeley added a comment -

        Yeah, facet.enum.cache.minDf seems reasonable (if a bit long).

        > Might it not be useful to turn off term enum caching when the number of terms was above a certain maximum [...] trade cycles for memory

        If one expects a really high number of terms, I think the right approach is to pick a minDf to cut down the cache size (and trade cycles for memory). Also, Solr doesn't currently know the number of terms in a field unless it's traversed all of them.

        Show
        Yonik Seeley added a comment - Yeah, facet.enum.cache.minDf seems reasonable (if a bit long). > Might it not be useful to turn off term enum caching when the number of terms was above a certain maximum [...] trade cycles for memory If one expects a really high number of terms, I think the right approach is to pick a minDf to cut down the cache size (and trade cycles for memory). Also, Solr doesn't currently know the number of terms in a field unless it's traversed all of them.
        Hide
        J.J. Larrea added a comment -

        Clearly Solr is going to end up with more than 2 algorithms for computing facets, and there's no reason to think they won't be able to happily coexist in SimpleFacets. And we will surely need additional control parameters even for the 2.5 (with your patch) algorithms now in place. So I think we should establish a convention for separating algorithm-specific parameters so we don't end up with a jumble of top-level parameters.

        So rather than facet.minDfFilterCache, how about:
        facet.enum.cache.minDF (enable term enum cache for terms with docFreq > minDF)
        f.<field>.facet.enum.cache.minDF

        Might it not be useful to turn off term enum caching when the number of terms was above a certain maximum, even if the minDF criterion is met, to trade cycles for memory when neither the field cache nor filter cache is practicable? In that case, it could be:
        facet.enum.cache.maxTerm (enable term enum cache for fields where numTerms <= maxTerm)

        Show
        J.J. Larrea added a comment - Clearly Solr is going to end up with more than 2 algorithms for computing facets, and there's no reason to think they won't be able to happily coexist in SimpleFacets. And we will surely need additional control parameters even for the 2.5 (with your patch) algorithms now in place. So I think we should establish a convention for separating algorithm-specific parameters so we don't end up with a jumble of top-level parameters. So rather than facet.minDfFilterCache, how about: facet.enum.cache.minDF (enable term enum cache for terms with docFreq > minDF) f.<field>.facet.enum.cache.minDF Might it not be useful to turn off term enum caching when the number of terms was above a certain maximum, even if the minDF criterion is met, to trade cycles for memory when neither the field cache nor filter cache is practicable? In that case, it could be: facet.enum.cache.maxTerm (enable term enum cache for fields where numTerms <= maxTerm)
        Hide
        Yonik Seeley added a comment -

        So for configuration, how about a SolrParam of
        facet.minDfFilterCache (can anyone think of a better name?), probably per-field.
        We can defer more complex configuration in order to fit this into Solr 1.2, as long as we don't think this single parameter is a mistake.

        Show
        Yonik Seeley added a comment - So for configuration, how about a SolrParam of facet.minDfFilterCache (can anyone think of a better name?), probably per-field. We can defer more complex configuration in order to fit this into Solr 1.2, as long as we don't think this single parameter is a mistake.
        Hide
        Yonik Seeley added a comment -

        > it might be worth trying to clarify if the performance cliff really results from being optimized or if it's just a result of one of the two traits of an optimized index: being a single segment, having no deletions.

        The large performance differences even when TermDocs weren't used (all fieldcache) strongly suggests that it's a SegmentReader vs MultiReader issue more than deleted docs, since I doubt the larger maxDoc would account for much time. It would be nice to know for sure though.

        > Minor point: if we're going to add facet config options, i'd prefer they stay as standard standard SolrParms

        I think we may need to look at it on a per-option basis (but I agree, these look like a candidate). Once 1.2 gets out the door, I'll probably get back to my facet cache work, and that will have some parameters that don't make sense to be able to tune per-request (or wouldn't even be possible).

        Show
        Yonik Seeley added a comment - > it might be worth trying to clarify if the performance cliff really results from being optimized or if it's just a result of one of the two traits of an optimized index: being a single segment, having no deletions. The large performance differences even when TermDocs weren't used (all fieldcache) strongly suggests that it's a SegmentReader vs MultiReader issue more than deleted docs, since I doubt the larger maxDoc would account for much time. It would be nice to know for sure though. > Minor point: if we're going to add facet config options, i'd prefer they stay as standard standard SolrParms I think we may need to look at it on a per-option basis (but I agree, these look like a candidate). Once 1.2 gets out the door, I'll probably get back to my facet cache work, and that will have some parameters that don't make sense to be able to tune per-request (or wouldn't even be possible).
        Hide
        Hoss Man added a comment -

        it might be worth trying to clarify if the performance cliff really results from being optimized or if it's just a result of one of the two traits of an optimized index: being a single segment, having no deletions.

        tuning the behavior based on either of those traits is just as easy as tuning based on both traits.

        Minor point: if we're going to add facet config options, i'd prefer they stay as standard standard SolrParms that can be defaulted i the handler config (and theoretically overridden per request) it just seems cleaner to have all options in one place, and there's not a lot of reason not to when dealing with options tha don't *need8 to be identicle for every request.

        Show
        Hoss Man added a comment - it might be worth trying to clarify if the performance cliff really results from being optimized or if it's just a result of one of the two traits of an optimized index: being a single segment, having no deletions. tuning the behavior based on either of those traits is just as easy as tuning based on both traits. Minor point: if we're going to add facet config options, i'd prefer they stay as standard standard SolrParms that can be defaulted i the handler config (and theoretically overridden per request) it just seems cleaner to have all options in one place, and there's not a lot of reason not to when dealing with options tha don't *need8 to be identicle for every request.
        Hide
        Yonik Seeley added a comment -

        Perhaps minDfFilterCache could be automatically tuned depending on if the index is optimized or not...
        Could allow specification of both in solrconfig.xml perhaps...

        <faceting>
        <minDfFilterCache>0</minDfFilterCache> <!-- always use filterCache in non-optimized index -->
        <minDfFilterCache index="optimized">50</minDfFilterCache> <!-- if optimized, only use filterCache when df>=50 -->
        </faceting>

        Show
        Yonik Seeley added a comment - Perhaps minDfFilterCache could be automatically tuned depending on if the index is optimized or not... Could allow specification of both in solrconfig.xml perhaps... <faceting> <minDfFilterCache>0</minDfFilterCache> <!-- always use filterCache in non-optimized index --> <minDfFilterCache index="optimized">50</minDfFilterCache> <!-- if optimized, only use filterCache when df>=50 --> </faceting>
        Hide
        Yonik Seeley added a comment -

        The results are slightly surprising.

        I made up an index, and each document contained 4 random numbers between 1 and 500,000
        This is not the distribution one would expect to see in a real index. but we can still learn much.

        The synthetic index:
        maxDoc=500,000
        numDocs=393,566
        number of segments = 5
        number of unique facet terms = 490903
        filterCache max size = 1,000,000 entries (more than enough)
        JVM=1.5.0_09 -server -Xmx200M
        System=WinXP, 3GHz P4, hyperthreaded, 1GB dual channel RAM
        facet type = facet.field, facet.sort=true, facet.limit=10
        maximum df of any term = 15
        warming times were not included... queries were run many times and the lowest time recorded.

        Number of documents that match test "base" queries (for example, base query #1 matches 175K docs):
        1) 175000,
        2) 43000
        3) 8682
        4) 2179
        5) 422
        6) 1

        WITHOUT PATCH (milliseconds to facet each base query):
        1578, 1578, 1547, 1485, 1484,1422

        WITH PATCH (min df comparison w/ term df, minDfFilterCache=0) (all field cache)
        984, 1203, 1391, 1437, 1484, 1420

        WITH PATCH (min df comp, minDfFilterCache=30) (no fieldCache at all)
        1406, 2344, 3125, 3015, 3172, 3172

        CONCLUSION1: min df comparison increases faceting speed 60% when the base query matches many documents. With a real term distribution, this could be even greater.

        CONCLUSION2: opting to not use the fieldCache for smaller df terms can save a lot of memory, but it hurts performance up to 200% for our non-optimized index.

        CONCLUSION3: using the field cache less can significantly speed up warming time (times not shown, but a full warming of the fieldCache took 33 sec)

        ======== now the same index, but optimized ===========
        WITH PATCH (optimized, min df comparison w/ term df, minDfFilterCache=0) (all field cache)
        172, 312, 485, 578, 610, 656

        WITH PATCH (optimized, min df comp, minDfFilterCache=30) (no fieldCache at all)
        265, 344, 422, 468, 500, 484

        CONCLUSION3: An optimized index increased performance 200-500%

        CONCLUSION4: The fact that an all-fieldcache option was significantly faster on an optimized probably cannot totally be explained by accurate dfs (no deleted documents to inflate the term df values), means that just iterating over the terms is much faster in an optimized index (a potential Lucene area to look into)

        Show
        Yonik Seeley added a comment - The results are slightly surprising. I made up an index, and each document contained 4 random numbers between 1 and 500,000 This is not the distribution one would expect to see in a real index. but we can still learn much. The synthetic index: maxDoc=500,000 numDocs=393,566 number of segments = 5 number of unique facet terms = 490903 filterCache max size = 1,000,000 entries (more than enough) JVM=1.5.0_09 -server -Xmx200M System=WinXP, 3GHz P4, hyperthreaded, 1GB dual channel RAM facet type = facet.field, facet.sort=true, facet.limit=10 maximum df of any term = 15 warming times were not included... queries were run many times and the lowest time recorded. Number of documents that match test "base" queries (for example, base query #1 matches 175K docs): 1) 175000, 2) 43000 3) 8682 4) 2179 5) 422 6) 1 WITHOUT PATCH (milliseconds to facet each base query): 1578, 1578, 1547, 1485, 1484,1422 WITH PATCH (min df comparison w/ term df, minDfFilterCache=0) (all field cache) 984, 1203, 1391, 1437, 1484, 1420 WITH PATCH (min df comp, minDfFilterCache=30) (no fieldCache at all) 1406, 2344, 3125, 3015, 3172, 3172 CONCLUSION1: min df comparison increases faceting speed 60% when the base query matches many documents. With a real term distribution, this could be even greater. CONCLUSION2: opting to not use the fieldCache for smaller df terms can save a lot of memory, but it hurts performance up to 200% for our non-optimized index. CONCLUSION3: using the field cache less can significantly speed up warming time (times not shown, but a full warming of the fieldCache took 33 sec) ======== now the same index, but optimized =========== WITH PATCH (optimized, min df comparison w/ term df, minDfFilterCache=0) (all field cache) 172, 312, 485, 578, 610, 656 WITH PATCH (optimized, min df comp, minDfFilterCache=30) (no fieldCache at all) 265, 344, 422, 468, 500, 484 CONCLUSION3: An optimized index increased performance 200-500% CONCLUSION4: The fact that an all-fieldcache option was significantly faster on an optimized probably cannot totally be explained by accurate dfs (no deleted documents to inflate the term df values), means that just iterating over the terms is much faster in an optimized index (a potential Lucene area to look into)

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development