Solr
  1. Solr
  2. SOLR-2855

Large number of range facets causes server to lock up

    Details

      Description

      (NOTE: I'm aware that this is not exactly a BUG in the sense that something is not working correctly, but it is still something that might be considered to be changed.)

      A bug in our code triggered an integer overflow which caused very large range facets to be requested on a TrieIntField
      (e.g. facet.range.start=-2147483648&facet.range.end=1000&facet.range.gap=1)

      This caused the Solr server to allocate huge amounts of memory so that it soon had filled up the whole 18 GB the JVM had available, and the Garbage Collector wasn't able to free it (fast enough?) and was thus taking up nearly all of the CPU, which finally led to a complete Denial of Service.

      While the cause for this behavior obviously lies within the buggy code, it might still be desirable for Solr not to lock up for "wrong" parameters but throw an error instead.
      Throwing an exception if more ranges than the facet.limit value (or a similar, new parameter) for this field would be generated might be a good solution.

        Activity

        Hide
        Hoss Man added a comment -

        The problem with start+gap+end is, that they are "easy" to get wrong, compared to, let's say, start+end+number_of_gaps.

        Hmmm... i'm not sure why that would be true - if your client code can accidentally compute & request facet.range.start=$EXTREMELY_LOW_NUM then couldn't someone else's client code just as easily compute & request facet.range.numOfGaps=$EXTREMELY_BIG_NUM ?

        I guess the latter is a more common use case anyway, and would save some calculation that can go haywire on overflows. Using static or stored numbers is always less error prone than calculated ones.

        My experiences is quite the opposite: Most use cases i've seen for range faceting wouldn't benefit from a static number of gaps request – particularly when start & end are specified at run time, because it usually makes the size of the gaps very arbitrary and not very meaningful to people (ie: if i'm searching for xmas gifts between $20 and $75 i'd rather see a price range facet in $10 increments (with the last one being $5) then see 10 even sized ranges of $5.50 each). Besides which: specifying the gap (instead of the number of gaps) is the only way date faceting is of any use at all.

        (I'd love to hear about more use cases where specifying start+end+numBuckets would be advantageous, but that really seems like it should be on the solr-user list, or in a new feature request issue)

        If you really want direct control over teh number of gaps, you may want to track SOLR-2366 – there's been a lot of hashing/trashing about what the API should look like, but at a high level it's aiming for a feature where you specify the exact endpoints of the ranges you want in a succinct manner.

        Show
        Hoss Man added a comment - The problem with start+gap+end is, that they are "easy" to get wrong, compared to, let's say, start+end+number_of_gaps. Hmmm... i'm not sure why that would be true - if your client code can accidentally compute & request facet.range.start=$EXTREMELY_LOW_NUM then couldn't someone else's client code just as easily compute & request facet.range.numOfGaps=$EXTREMELY_BIG_NUM ? I guess the latter is a more common use case anyway, and would save some calculation that can go haywire on overflows. Using static or stored numbers is always less error prone than calculated ones. My experiences is quite the opposite: Most use cases i've seen for range faceting wouldn't benefit from a static number of gaps request – particularly when start & end are specified at run time, because it usually makes the size of the gaps very arbitrary and not very meaningful to people (ie: if i'm searching for xmas gifts between $20 and $75 i'd rather see a price range facet in $10 increments (with the last one being $5) then see 10 even sized ranges of $5.50 each). Besides which: specifying the gap (instead of the number of gaps) is the only way date faceting is of any use at all. (I'd love to hear about more use cases where specifying start+end+numBuckets would be advantageous, but that really seems like it should be on the solr-user list, or in a new feature request issue) If you really want direct control over teh number of gaps, you may want to track SOLR-2366 – there's been a lot of hashing/trashing about what the API should look like, but at a high level it's aiming for a feature where you specify the exact endpoints of the ranges you want in a succinct manner.
        Hide
        Bernhard Frauendienst added a comment -

        I'm afraid I have to agree with you

        The only reason I filed a bug for this is the huge pain this issue has caused us. It was very hard to pin down, and it took us ages until we could reproduce it in a testing environment. It didn't throw any errors, there was nothing in the logs (since the query never completed, and in the rare cases it did, by that time all queries had insanely high QTimes). The only thing we saw was the servers locking up.

        The problem with start+gap+end is, that they are "easy" to get wrong, compared to, let's say, start+end+number_of_gaps.
        I guess the latter is a more common use case anyway, and would save some calculation that can go haywire on overflows. Using static or stored numbers is always less error prone than calculated ones.

        wrt timeAllowed: Ah, yes, I should have known that one

        Show
        Bernhard Frauendienst added a comment - I'm afraid I have to agree with you The only reason I filed a bug for this is the huge pain this issue has caused us. It was very hard to pin down, and it took us ages until we could reproduce it in a testing environment. It didn't throw any errors, there was nothing in the logs (since the query never completed, and in the rare cases it did, by that time all queries had insanely high QTimes). The only thing we saw was the servers locking up. The problem with start+gap+end is, that they are "easy" to get wrong, compared to, let's say, start+end+ number_of_gaps . I guess the latter is a more common use case anyway, and would save some calculation that can go haywire on overflows. Using static or stored numbers is always less error prone than calculated ones. wrt timeAllowed: Ah, yes, I should have known that one
        Hide
        Hoss Man added a comment -

        Throwing an exception if more ranges than the facet.limit value (or a similar, new parameter) for this field would be generated might be a good solution.

        There are already params that specify a limit on the number of ranges computed: they are facet.range.start + facet.range.end + facet.range.gap. If those values are computed programaticly by a client, adding a new "facet.range.max.expected.ranges" param wouldn't make this problem go away since the client could just as easily compute an incorrect value for the new facet.range.max.expected.ranges param as well.

        If you'd like to submit a patch to add a new sanity checking param like this for range faceting you're free to do so – but personally i don't see a lot of value. the general problem description is "Client code sent request that asked solr to do more work then client really wanted" and a single example of that is range faceting – other examples would be really big values for the rows or facet.limit or timeAllowed params, or using "fl=*" on an index with thousands of stored fields ... at a certain point we have to just assume the client app really does want what it asked for

        FWIW...

        b) the query isn't aborted by overstepping the timeAllowed value

        As documented, timeAllowed is only applied to the search for queries matching the document, not total request processing (ie: it doesn't affect request parsing, query parsing, faceting, highlighting, response writing, etc...)

        https://wiki.apache.org/solr/CommonQueryParameters#timeAllowed

        Show
        Hoss Man added a comment - Throwing an exception if more ranges than the facet.limit value (or a similar, new parameter) for this field would be generated might be a good solution. There are already params that specify a limit on the number of ranges computed: they are facet.range.start + facet.range.end + facet.range.gap. If those values are computed programaticly by a client, adding a new "facet.range.max.expected.ranges" param wouldn't make this problem go away since the client could just as easily compute an incorrect value for the new facet.range.max.expected.ranges param as well. If you'd like to submit a patch to add a new sanity checking param like this for range faceting you're free to do so – but personally i don't see a lot of value. the general problem description is "Client code sent request that asked solr to do more work then client really wanted" and a single example of that is range faceting – other examples would be really big values for the rows or facet.limit or timeAllowed params, or using "fl=*" on an index with thousands of stored fields ... at a certain point we have to just assume the client app really does want what it asked for FWIW... b) the query isn't aborted by overstepping the timeAllowed value As documented, timeAllowed is only applied to the search for queries matching the document, not total request processing (ie: it doesn't affect request parsing, query parsing, faceting, highlighting, response writing, etc...) https://wiki.apache.org/solr/CommonQueryParameters#timeAllowed
        Hide
        Bernhard Frauendienst added a comment -

        I would like to add that
        a) the server doesn't run out of memory (in the sense of OOM), but allocates so many objects in such a short time (several Gigabytes per second in our setup) that the GC doesn't catch up and take up all CPU time
        b) the query isn't aborted by overstepping the timeAllowed value

        Show
        Bernhard Frauendienst added a comment - I would like to add that a) the server doesn't run out of memory (in the sense of OOM), but allocates so many objects in such a short time (several Gigabytes per second in our setup) that the GC doesn't catch up and take up all CPU time b) the query isn't aborted by overstepping the timeAllowed value

          People

          • Assignee:
            Unassigned
            Reporter:
            Bernhard Frauendienst
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development