Lucene - Core
  1. Lucene - Core
  2. LUCENE-2257

relax the per-segment max unique term limit

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.2, 3.0.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in a single segment.

      But I think we can improve this to termIndexInterval (default 128) * 2.1B. There is one place (internal API only) where Lucene uses an int but should use a long.

      1. LUCENE-2257.patch
        1 kB
        Michael McCandless
      2. LUCENE-2257.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Possible patch fixing the issue. I'm not yet certain there is no other place where we use an int...

        Show
        Michael McCandless added a comment - Possible patch fixing the issue. I'm not yet certain there is no other place where we use an int...
        Hide
        Tom Burton-West added a comment -

        Thanks for the patch Michael,

        The patch worked fine with CheckIndex. Checkindex worked with an index with 2.49 billion terms.
        I added commas to the output below:
        test: terms, freq, prox...OK [2,487,224,745 terms; 23,573,976,855 terms/docs pairs; 97,223,318,067 tokens]

        We are working on determining how to test it with Solr. The ArrayIndexOutOfBounds exception appears in the logs about for about 1 in every 100 queries. We haven't been able to determine which queries trigger the problem.

        We are using an older version of Solr with lucene 2.9-dev 779312 - 2009-05-27 17:19:55 . I'm not sure if we can just drop in a later version of lucene with the patch or if I need to patch the older 2.9 dev lucene version that came with our Solr. What do you suggest?

        What I'm thinking of is to run 10,000 queries against our dev server pointing at one of the large segment indexes with and without the patch.

        Tom

        Show
        Tom Burton-West added a comment - Thanks for the patch Michael, The patch worked fine with CheckIndex. Checkindex worked with an index with 2.49 billion terms. I added commas to the output below: test: terms, freq, prox...OK [2,487,224,745 terms; 23,573,976,855 terms/docs pairs; 97,223,318,067 tokens] We are working on determining how to test it with Solr. The ArrayIndexOutOfBounds exception appears in the logs about for about 1 in every 100 queries. We haven't been able to determine which queries trigger the problem. We are using an older version of Solr with lucene 2.9-dev 779312 - 2009-05-27 17:19:55 . I'm not sure if we can just drop in a later version of lucene with the patch or if I need to patch the older 2.9 dev lucene version that came with our Solr. What do you suggest? What I'm thinking of is to run 10,000 queries against our dev server pointing at one of the large segment indexes with and without the patch. Tom
        Hide
        Michael McCandless added a comment -

        OK, I'm glad to hear that.

        The attached patch applies to 2.9, and I think should apply fine to the revision of Lucene you're using (779312) that you're using within Solr. I'd recommend checking out that exact revision of Lucene (svn co -r779312 ...), applying this patch, building a JAR, and replacing Solr's Lucene JAR with this one.

        It's only queries that contain terms above the 2.1B mark (your last ~390 M terms) that will hit the exception. Once you find such a query it should always hit the exception on this large segment.

        Show
        Michael McCandless added a comment - OK, I'm glad to hear that. The attached patch applies to 2.9, and I think should apply fine to the revision of Lucene you're using (779312) that you're using within Solr. I'd recommend checking out that exact revision of Lucene (svn co -r779312 ...), applying this patch, building a JAR, and replacing Solr's Lucene JAR with this one. It's only queries that contain terms above the 2.1B mark (your last ~390 M terms) that will hit the exception. Once you find such a query it should always hit the exception on this large segment.
        Hide
        Tom Burton-West added a comment -

        Hi Michael,

        Thanks for your help. We mounted one of the indexes with 2.4 billion terms on our dev server and tested with and without the patch. (I discovered that queries containing Korean characters would consistently trigger the bug). With the patch, we don't see any ArrayIndexOutOfBounds exceptions. We are going to do a bit more testing before we put this into production. (We rolled back our production indexes temporarily to indexes that split the terms over 2 segments and therefore didn't trigger the bug).

        Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms?

        Tom

        Show
        Tom Burton-West added a comment - Hi Michael, Thanks for your help. We mounted one of the indexes with 2.4 billion terms on our dev server and tested with and without the patch. (I discovered that queries containing Korean characters would consistently trigger the bug). With the patch, we don't see any ArrayIndexOutOfBounds exceptions. We are going to do a bit more testing before we put this into production. (We rolled back our production indexes temporarily to indexes that split the terms over 2 segments and therefore didn't trigger the bug). Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms? Tom
        Hide
        Robert Muir added a comment -

        (I discovered that queries containing Korean characters would consistently trigger the bug).

        this makes sense because Hangul is sorted towards the end of the term dictionary

        you can see this visually here: http://unicode.org/roadmaps/bmp/

        Show
        Robert Muir added a comment - (I discovered that queries containing Korean characters would consistently trigger the bug). this makes sense because Hangul is sorted towards the end of the term dictionary you can see this visually here: http://unicode.org/roadmaps/bmp/
        Hide
        Michael McCandless added a comment -

        With the patch, we don't see any ArrayIndexOutOfBounds exceptions.

        Great! And the results look correct?

        Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms?

        Not that I know of! The code that handles the term dict lookup is
        fairly contained, in TermInfosReader and SegmentTermEnum. I think
        scrutinizing the code and testing (as you're doing) is the only way.

        I just looked again – there are a few places where int is still being used.

        First is two methods (get(int position) and scanEnum), in
        TermInfosReader, that are actually dead code (package private &
        unused). Second is int SegmentTermEnum.scanTo, but this is fine
        because it's never asked to can more than termIndexInterval terms.

        I'll attach patch that additionally just removes that dead code.

        Show
        Michael McCandless added a comment - With the patch, we don't see any ArrayIndexOutOfBounds exceptions. Great! And the results look correct? Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms? Not that I know of! The code that handles the term dict lookup is fairly contained, in TermInfosReader and SegmentTermEnum. I think scrutinizing the code and testing (as you're doing) is the only way. I just looked again – there are a few places where int is still being used. First is two methods (get(int position) and scanEnum), in TermInfosReader, that are actually dead code (package private & unused). Second is int SegmentTermEnum.scanTo, but this is fine because it's never asked to can more than termIndexInterval terms. I'll attach patch that additionally just removes that dead code.
        Hide
        Michael McCandless added a comment -

        Committed for 2.9.2, 3.0.1, 3.1.

        Show
        Michael McCandless added a comment - Committed for 2.9.2, 3.0.1, 3.1.
        Hide
        Koji Sekiguchi added a comment -

        Hello,
        I'd like to confirm what the "term" means in "unique terms". Is the term Term? (unique terms in whole fields in a single segment?) or word? (unique terms in each field in a single segment?). Thanks.

        Show
        Koji Sekiguchi added a comment - Hello, I'd like to confirm what the "term" means in "unique terms". Is the term Term? (unique terms in whole fields in a single segment?) or word? (unique terms in each field in a single segment?). Thanks.
        Hide
        Michael McCandless added a comment -

        Yes, the limit is number of unique terms per-segment.

        Flex actually increases the limit (the limit is per-field, per-segment; but in trunk, the limit is across all fields).

        Show
        Michael McCandless added a comment - Yes, the limit is number of unique terms per-segment. Flex actually increases the limit (the limit is per-field, per-segment; but in trunk, the limit is across all fields).

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development