Lucene - Core
  1. Lucene - Core
  2. LUCENE-504

FuzzyQuery produces a "java.lang.NegativeArraySizeException" in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 3.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      PriorityQueue creates an "java.lang.NegativeArraySizeException" when initialized with Integer.MAX_VALUE, because Integer overflows. I think this could be a general problem with PriorityQueue. The Error occured when I set BooleanQuery.MaxClauseCount to Integer.MAX_VALUE and user a FuzzyQuery for searching.

      1. BooleanQuery.java.diff
        0.8 kB
        Doron Cohen
      2. fuzzyquery.patch
        5 kB
        Nadav Har'El
      3. LUCENE-504.patch
        6 kB
        Uwe Schindler
      4. LUCENE-504.patch
        3 kB
        Uwe Schindler
      5. PriorityQueue.java.diff
        0.9 kB
        Doron Cohen
      6. TestFuzzyQueryError.java
        2 kB
        Joerg Henss

        Activity

        Hide
        Joerg Henss added a comment -

        Simple test showing the error

        Show
        Joerg Henss added a comment - Simple test showing the error
        Hide
        Paul Borgermans added a comment -

        Here the same problem, used a MultiFieldQueryParser in combination with fuzzy search which triggered the exception

        Show
        Paul Borgermans added a comment - Here the same problem, used a MultiFieldQueryParser in combination with fuzzy search which triggered the exception
        Hide
        Doron Cohen added a comment -

        LuceneFAQ item "Why am I getting a TooManyClauses exception?"
        suggests: "use BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE)".

        This would cause PriorityQueue to create an array of size maxint:

        • SUN JRE throws an out-of-memory error.
        • IBM JRE throws a "too large allocation" runtime error.

        Seems that at least two fixes are required:
        + BooleanQuery - hard limit on value of maxClauseCount
        + PriorityQueue - hard limit on size of queue, since it is stored in memory.

        The attached would set 1,000,000 hard limit - defined in PriorityQueu - I think the most intensive use of
        priority queue is for docs/hits during search, and 1M docs seems sufficient, unless there are uses that I am not aware of.

        Show
        Doron Cohen added a comment - LuceneFAQ item "Why am I getting a TooManyClauses exception?" suggests: "use BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE)". This would cause PriorityQueue to create an array of size maxint: SUN JRE throws an out-of-memory error. IBM JRE throws a "too large allocation" runtime error. Seems that at least two fixes are required: + BooleanQuery - hard limit on value of maxClauseCount + PriorityQueue - hard limit on size of queue, since it is stored in memory. The attached would set 1,000,000 hard limit - defined in PriorityQueu - I think the most intensive use of priority queue is for docs/hits during search, and 1M docs seems sufficient, unless there are uses that I am not aware of.
        Hide
        Otis Gospodnetic added a comment -

        I imagine the limit will depend on how big of a heap you allow, no? What happens if you increase the heap size drammatically with -XmxNNNm?

        Show
        Otis Gospodnetic added a comment - I imagine the limit will depend on how big of a heap you allow, no? What happens if you increase the heap size drammatically with -XmxNNNm?
        Hide
        Doron Cohen added a comment -

        Yes this is correct - e.g. on a win32 machine with 2GB RAM, SUN 1.5 JRE would accept up to Xmx1470m and in that case you could set the limit on the queue size to 355,638,512 - 17% of maxint, before getting an out of mem error.

        For allowing the caller maximal flexibility (and responsibility), BooleanQuery could interpret the maxint as a hint saying "maximal possible value" and then silently modify it to maxint-1, thereby avoiding the negative array size issue in PriorityQueue (and possibly fail later with out of memory).

        Is this what you have in mind?

        Show
        Doron Cohen added a comment - Yes this is correct - e.g. on a win32 machine with 2GB RAM, SUN 1.5 JRE would accept up to Xmx1470m and in that case you could set the limit on the queue size to 355,638,512 - 17% of maxint, before getting an out of mem error. For allowing the caller maximal flexibility (and responsibility), BooleanQuery could interpret the maxint as a hint saying "maximal possible value" and then silently modify it to maxint-1, thereby avoiding the negative array size issue in PriorityQueue (and possibly fail later with out of memory). Is this what you have in mind?
        Hide
        Nadav Har'El added a comment -

        Hi Doron and Otis,

        My view is that this bug is a problem in FuzzyQuery, not in PriorityQueue or BooleanQuery. It is the caller's duty to create a priority queue with a sensible size, and it's not BooleanQuery's fault that other classes are using its getMaxClauseCount() wrongly. Moreover, changing PriorityQueue or BooleanQuery in the way Doron did, might potentially have side-effects because they are used in many places in Lucene. How do we know that nowhere in Lucene will we ever need a priority queue with a million elements?

        Therefore I suggest a different patch, changing only FuzzyQuery.

        he idea in the patch I'm attaching is that while FuzzyQuery is allowed to create BooleanQuery.getMaxClauseCount() clauses, it doesn't have to do so, and doesn't need to create a priority queue of that size when this number is huge. I added to FuzzyQuery its own maxClauseCount (with per-instance getter/setter methods) and FuzzyQuery will never try to create more than this number of clauses, even if Boolean.getMaxClauseCount() is huge.

        I set FuzzyQuery's default maxClauseCount to 1024, for backward compatibility (this is what happens today when you use BooleanQuery's defaults). However, it is likely that a user will want to set this number much lower than that. In fact, the whole idea of the priority queue is that it may be enough to take only a small number the most similar terms, so setting a FuzzyQuery's maxClauseCount to 100 or even 10 makes sense to me. This new feature is an added benefit of my patch.

        My patch also includes Javadoc changes describing the new feature, and a new test case that failed before the fix, and succeeds after this patch.

        Show
        Nadav Har'El added a comment - Hi Doron and Otis, My view is that this bug is a problem in FuzzyQuery, not in PriorityQueue or BooleanQuery. It is the caller's duty to create a priority queue with a sensible size, and it's not BooleanQuery's fault that other classes are using its getMaxClauseCount() wrongly. Moreover, changing PriorityQueue or BooleanQuery in the way Doron did, might potentially have side-effects because they are used in many places in Lucene. How do we know that nowhere in Lucene will we ever need a priority queue with a million elements? Therefore I suggest a different patch, changing only FuzzyQuery. he idea in the patch I'm attaching is that while FuzzyQuery is allowed to create BooleanQuery.getMaxClauseCount() clauses, it doesn't have to do so, and doesn't need to create a priority queue of that size when this number is huge. I added to FuzzyQuery its own maxClauseCount (with per-instance getter/setter methods) and FuzzyQuery will never try to create more than this number of clauses, even if Boolean.getMaxClauseCount() is huge. I set FuzzyQuery's default maxClauseCount to 1024, for backward compatibility (this is what happens today when you use BooleanQuery's defaults). However, it is likely that a user will want to set this number much lower than that. In fact, the whole idea of the priority queue is that it may be enough to take only a small number the most similar terms, so setting a FuzzyQuery's maxClauseCount to 100 or even 10 makes sense to me. This new feature is an added benefit of my patch. My patch also includes Javadoc changes describing the new feature, and a new test case that failed before the fix, and succeeds after this patch.
        Hide
        Nadav Har'El added a comment -

        This is my proposed patch described above.

        Show
        Nadav Har'El added a comment - This is my proposed patch described above.
        Hide
        Nadav Har'El added a comment -

        Hi Otis, you did not comment on my patch (fuzzyquery.patch), which I think solves your objections to Doron's previous patch. Do you also see problems in this patchl, or would prefer a different approach to solving this issue?
        If not, perhaps you can commit this patch?
        Thanks in advance, Nadav.

        Show
        Nadav Har'El added a comment - Hi Otis, you did not comment on my patch (fuzzyquery.patch), which I think solves your objections to Doron's previous patch. Do you also see problems in this patchl, or would prefer a different approach to solving this issue? If not, perhaps you can commit this patch? Thanks in advance, Nadav.
        Hide
        Doron Cohen added a comment -

        I think it makes sense to separate here between efficiency and correctness.

        The proposed fix above deals with efficiency, and maybe it should become a new separate future issue, perhaps also considering a more lazy allocation of space by PriorityQueue.

        This issue is about correctness - setting max-clause "by-the-book" throws an exception - let's fix just that..

        Modifying PriorityQueue to be tolerant to maxint capacity is one possible solution, though perhaps an overkill.

        A simpler way is for PriorityQueue to silently decrement its maxsize by 1 in case maxsize is requested to be maxint.

        Show
        Doron Cohen added a comment - I think it makes sense to separate here between efficiency and correctness. The proposed fix above deals with efficiency, and maybe it should become a new separate future issue, perhaps also considering a more lazy allocation of space by PriorityQueue. This issue is about correctness - setting max-clause "by-the-book" throws an exception - let's fix just that.. Modifying PriorityQueue to be tolerant to maxint capacity is one possible solution, though perhaps an overkill. A simpler way is for PriorityQueue to silently decrement its maxsize by 1 in case maxsize is requested to be maxint.
        Hide
        Mark Miller added a comment -

        This really should be fixed. I agree with Nadav, the problem is with the Queries that use MaxClauses to build the priority queue - its just not good use of the queue.

        I also agree with Doron that its really two distinct issues though:

        1. FuzzyQuery et al should be using lazy allocation of some kind. It makes no sense to create an array that large for every multi term query, especially when we direct users to possibly use Integer.MAX_VALUE.

        2. The FAQ should not say to use Integer.MAX_VALUE if its going to kak with FuzzyQuery. At a minimum, FuzzyQuery should not init the priority queue larger than Integer.MAX_VALUE-1.

        I suppose, that almost combines the 2 issues though - we are telling users to set the max clauses to a number that will force ridiculously wasteful memory allocation. Almost seems like both issues really need to be addressed together. Or maybe just the FAQ changed

        Show
        Mark Miller added a comment - This really should be fixed. I agree with Nadav, the problem is with the Queries that use MaxClauses to build the priority queue - its just not good use of the queue. I also agree with Doron that its really two distinct issues though: 1. FuzzyQuery et al should be using lazy allocation of some kind. It makes no sense to create an array that large for every multi term query, especially when we direct users to possibly use Integer.MAX_VALUE. 2. The FAQ should not say to use Integer.MAX_VALUE if its going to kak with FuzzyQuery. At a minimum, FuzzyQuery should not init the priority queue larger than Integer.MAX_VALUE-1. I suppose, that almost combines the 2 issues though - we are telling users to set the max clauses to a number that will force ridiculously wasteful memory allocation. Almost seems like both issues really need to be addressed together. Or maybe just the FAQ changed
        Hide
        George Papas added a comment -

        Hi,

        This is still an issue in 2.4.0. I know this is low priority, but has there been any more thinking about how to address this?

        Thanks
        George.

        Show
        George Papas added a comment - Hi, This is still an issue in 2.4.0. I know this is low priority, but has there been any more thinking about how to address this? Thanks George.
        Hide
        Florian Waltersdorfer added a comment -

        Hi,

        something along the line of
        ScoreDoc[] results = searcher.search(query, MAX_NUMBER_OF_HITS).scoreDocs;
        causes a viable crash. I would consider this a "natural" start when trying to write an own searcher (MAX_NUMBER_OF_HITS beeing Integer.MAX_VALUE) and without the sources attached & the (eclipse) debugger I would have never found the problem I'd guess.
        So please do something about this pitfall, if only adding a warning in the javadoc of i.e. Searcher.search(..., int n, ...)

        Thanks
        Florian

        Show
        Florian Waltersdorfer added a comment - Hi, something along the line of ScoreDoc[] results = searcher.search(query, MAX_NUMBER_OF_HITS).scoreDocs; causes a viable crash. I would consider this a "natural" start when trying to write an own searcher (MAX_NUMBER_OF_HITS beeing Integer.MAX_VALUE) and without the sources attached & the (eclipse) debugger I would have never found the problem I'd guess. So please do something about this pitfall, if only adding a warning in the javadoc of i.e. Searcher.search(..., int n, ...) Thanks Florian
        Hide
        Ivan Rozhnov added a comment -

        Hi guys,

        It seems to me that problem is still opened. Can it be fixed with dynamic size of storage in PriorityQueue and couple of similar classes and using MaxClauseCount as top limit for size of such storage. It seems to be very weird to have preinitialized array of Max size in collection constructor.

        Thanks,
        Ivan

        Show
        Ivan Rozhnov added a comment - Hi guys, It seems to me that problem is still opened. Can it be fixed with dynamic size of storage in PriorityQueue and couple of similar classes and using MaxClauseCount as top limit for size of such storage. It seems to be very weird to have preinitialized array of Max size in collection constructor. Thanks, Ivan
        Hide
        Michael McCandless added a comment -

        Since 3.0 is now on Java 1.5, can't we switch to Java's PriorityQueue? Anyone want to cough up a patch?

        Show
        Michael McCandless added a comment - Since 3.0 is now on Java 1.5, can't we switch to Java's PriorityQueue? Anyone want to cough up a patch?
        Hide
        Uwe Schindler added a comment - - edited

        We had a discussion about that in another issue. In general PriorityQueue of Java 1.5 does not have the features we need for Lucene (it dynamically grows, but the grow process is not controllable, making it unuseable for collecting TopDocs and so on). But I think for this special case, we could use Java 5's PQ.

        Show
        Uwe Schindler added a comment - - edited We had a discussion about that in another issue. In general PriorityQueue of Java 1.5 does not have the features we need for Lucene (it dynamically grows, but the grow process is not controllable, making it unuseable for collecting TopDocs and so on). But I think for this special case, we could use Java 5's PQ.
        Hide
        Nadav Har'El added a comment -

        Hi Uwe, I think that even though PriorityQueue doesn't have a size limit, it is easy to implement a size limit: after an add(), if size() becomes greater than the bound, you simply poll() to remove the lowest element (this poll() returns the old object which insertWithOverflow() is to return).

        However, I think it's a good idea to compare the performance of Java's PriorityQueue (used as in the paragraph above) . I'm especially worried about the slowdown by the fact that adding a small element (below the current heap's head) in our code just does one comparison and returns, but in the usage I described above it actually modifies the heap twice (adds the element to the heap and then removes it).

        Show
        Nadav Har'El added a comment - Hi Uwe, I think that even though PriorityQueue doesn't have a size limit, it is easy to implement a size limit: after an add(), if size() becomes greater than the bound, you simply poll() to remove the lowest element (this poll() returns the old object which insertWithOverflow() is to return). However, I think it's a good idea to compare the performance of Java's PriorityQueue (used as in the paragraph above) . I'm especially worried about the slowdown by the fact that adding a small element (below the current heap's head) in our code just does one comparison and returns, but in the usage I described above it actually modifies the heap twice (adds the element to the heap and then removes it).
        Hide
        Uwe Schindler added a comment - - edited

        Nadav:
        I suggest to keep Lucene's PriorityQueue, because it is a very central and highly optimized part of Liucene. In Lucene 3.0 it is already generified, so it also fits perfectly into Java's Collection API. The only problem is that the name is now identical to one internal Java class, but we cannot change it without BW breaks.

        For this special issue, we should fix only FuzzyQuery to use Java5's PQ, which dynamically grows when new elements are added. And we do not need the upper limit here, like you propsed.

        I will prepare a patch tomorrow in the ApacheCon hacking session.

        EDIT

        We need the upper limit here, but we can implement it like you proposed.

        Show
        Uwe Schindler added a comment - - edited Nadav: I suggest to keep Lucene's PriorityQueue, because it is a very central and highly optimized part of Liucene. In Lucene 3.0 it is already generified, so it also fits perfectly into Java's Collection API. The only problem is that the name is now identical to one internal Java class, but we cannot change it without BW breaks. For this special issue, we should fix only FuzzyQuery to use Java5's PQ, which dynamically grows when new elements are added. And we do not need the upper limit here, like you propsed. I will prepare a patch tomorrow in the ApacheCon hacking session. EDIT We need the upper limit here, but we can implement it like you proposed.
        Hide
        Uwe Schindler added a comment -

        Here is a patch for this issue, using j.u.PriorityQueue. It currently does not limit the PQ's number of entries, it just only consumes maxClauseCount ones.

        All tests pass, but they are no real test of the PQ behaviour, as the current test cases do not test more terms than maxClauseCount. So the tests pass in all cases, independent how the compareTo method looks like, so the ordering is not important because the queue never gets full. I will add a test.

        I will also try to implement the max size, but for now, the patch shows, how the code could look like with j.u.PQ.

        Show
        Uwe Schindler added a comment - Here is a patch for this issue, using j.u.PriorityQueue. It currently does not limit the PQ's number of entries, it just only consumes maxClauseCount ones. All tests pass, but they are no real test of the PQ behaviour, as the current test cases do not test more terms than maxClauseCount. So the tests pass in all cases, independent how the compareTo method looks like, so the ordering is not important because the queue never gets full. I will add a test. I will also try to implement the max size, but for now, the patch shows, how the code could look like with j.u.PQ.
        Hide
        Uwe Schindler added a comment -

        Here a patch, that fixes FuzzyQuery. It uses j.u.PQ and has a small optimization to not add ScoreTerms, that would never be seen in the first maxClauseCount terms (tracking a bottom value).

        It also adds an testcase, where the maxClauseCount is lowerd downto 2 and the FuzzyQuery would hit 3 terms, so one too much. This tests the algorithm to not add this entry to the PQ.

        Nadav: The solution you propsed for limiting the number of entries in the PQ does not work, as poll() removes the head element of the queue, not the bottom that falls out. There is no way to remove the bottom value in Java's PQ easily. So we should j.u.PQ only for this issue.

        Show
        Uwe Schindler added a comment - Here a patch, that fixes FuzzyQuery. It uses j.u.PQ and has a small optimization to not add ScoreTerms, that would never be seen in the first maxClauseCount terms (tracking a bottom value). It also adds an testcase, where the maxClauseCount is lowerd downto 2 and the FuzzyQuery would hit 3 terms, so one too much. This tests the algorithm to not add this entry to the PQ. Nadav: The solution you propsed for limiting the number of entries in the PQ does not work, as poll() removes the head element of the queue, not the bottom that falls out. There is no way to remove the bottom value in Java's PQ easily. So we should j.u.PQ only for this issue.
        Hide
        Nadav Har'El added a comment -

        Uwe, you are right, I got confused... Sorry.

        Show
        Nadav Har'El added a comment - Uwe, you are right, I got confused... Sorry.
        Hide
        Uwe Schindler added a comment -

        I think I take this one and commit it, is it ok? Thesolution seems to work quite good.

        Show
        Uwe Schindler added a comment - I think I take this one and commit it, is it ok? Thesolution seems to work quite good.
        Hide
        Uwe Schindler added a comment -

        Committed revision: 833544

        Show
        Uwe Schindler added a comment - Committed revision: 833544

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Joerg Henss
          • Votes:
            4 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development