Lucene - Core
  1. Lucene - Core
  2. LUCENE-5843

IndexWriter should refuse to create an index with more than INT_MAX docs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9.1, 4.10, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It's more and more common for users these days to create very large indices, e.g. indexing lines from log files, or packets on a network, etc., and it's not hard to accidentally exceed the maximum number of documents in one index.

      I think the limit is actually Integer.MAX_VALUE-1 docs, because we use that value as a sentinel during searching.

      I'm not sure what IW does today if you create a too-big index but it's probably horrible; it may succeed and then at search time you hit nasty exceptions when we overflow int.

      I think it should throw an IndexFullException instead. It'd be nice if we could do this on the very doc that when added would go over the limit, but I would also settle for just throwing at flush as well ... i.e. I think what's really important is that the index does not become unusable.

      1. LUCENE-5843.patch
        30 kB
        Michael McCandless
      2. LUCENE-5843.patch
        31 kB
        Michael McCandless
      3. LUCENE-5843.patch
        29 kB
        Michael McCandless
      4. LUCENE-5843.patch
        23 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          I'm not sure what IW does today if you create a too-big index but it's probably horrible; it may succeed and then at search time you hit nasty exceptions when we overflow int.

          that is exactly what happens – see SOLR-6065 for context

          Show
          Hoss Man added a comment - I'm not sure what IW does today if you create a too-big index but it's probably horrible; it may succeed and then at search time you hit nasty exceptions when we overflow int. that is exactly what happens – see SOLR-6065 for context
          Hide
          Robert Muir added a comment -

          In my opinion the way to go is to have a package-private limit (for test purposes) that defaults to Integer.MAX_VALUE.

          this way we can actually test the thing with values like... 5

          Its more than just checking either at addDocument (ideal) or flush (not great but as you say, better), we also have to handle cases like addIndexes(Dir) and addIndexes(Reader).

          Show
          Robert Muir added a comment - In my opinion the way to go is to have a package-private limit (for test purposes) that defaults to Integer.MAX_VALUE. this way we can actually test the thing with values like... 5 Its more than just checking either at addDocument (ideal) or flush (not great but as you say, better), we also have to handle cases like addIndexes(Dir) and addIndexes(Reader).
          Hide
          Jack Krupansky added a comment -

          That Solr Jira has my comments as well, but I just want to reiterate that the actual limit should be more clearly documented. I filed a Jira for that quite awhile ago - LUCENE-4104. And if this new issue will resolve the problem, please mark my old LUCENE-4105 issue as a duplicate.

          Show
          Jack Krupansky added a comment - That Solr Jira has my comments as well, but I just want to reiterate that the actual limit should be more clearly documented. I filed a Jira for that quite awhile ago - LUCENE-4104 . And if this new issue will resolve the problem, please mark my old LUCENE-4105 issue as a duplicate.
          Hide
          Uwe Schindler added a comment -

          I'm not sure what IW does today if you create a too-big index but it's probably horrible; it may succeed and then at search time you hit nasty exceptions when we overflow int.

          If a single segment while merging exceeds the limit, its horrible. If you have an index that exceeds the limit, you get an Exception when opening: BaseCompositeReader throws Exception in its ctor:

                maxDoc += r.maxDoc();      // compute maxDocs
                if (maxDoc < 0 /* overflow */) {
                  throw new IllegalArgumentException("Too many documents, composite IndexReaders cannot exceed " + Integer.MAX_VALUE);
                }
          

          The limit is MAX_VALUE, the -1 is just a stupid limitation of TopDocs, but it is actually smaller, because arrays have a maximum size in Java. DocIdSetIterators sentinel is not a problem, because its simply the last document (MAX_VALUE), which is always the last possible one (the iterator is always exhausted is you reach the last doc).

          Show
          Uwe Schindler added a comment - I'm not sure what IW does today if you create a too-big index but it's probably horrible; it may succeed and then at search time you hit nasty exceptions when we overflow int. If a single segment while merging exceeds the limit, its horrible. If you have an index that exceeds the limit, you get an Exception when opening: BaseCompositeReader throws Exception in its ctor: maxDoc += r.maxDoc(); // compute maxDocs if (maxDoc < 0 /* overflow */) { throw new IllegalArgumentException( "Too many documents, composite IndexReaders cannot exceed " + Integer .MAX_VALUE); } The limit is MAX_VALUE, the -1 is just a stupid limitation of TopDocs, but it is actually smaller, because arrays have a maximum size in Java. DocIdSetIterators sentinel is not a problem, because its simply the last document (MAX_VALUE), which is always the last possible one (the iterator is always exhausted is you reach the last doc).
          Hide
          Michael McCandless added a comment -

          The limit is MAX_VALUE, the -1 is just a stupid limitation of TopDocs, but it is actually smaller, because arrays have a maximum size in Java.

          Yeah, I'll set the limit (public static int on IndexWriter) to ArrayUtil.MAX_ARRAY_LENGTH.

          Show
          Michael McCandless added a comment - The limit is MAX_VALUE, the -1 is just a stupid limitation of TopDocs, but it is actually smaller, because arrays have a maximum size in Java. Yeah, I'll set the limit (public static int on IndexWriter) to ArrayUtil.MAX_ARRAY_LENGTH.
          Hide
          Uwe Schindler added a comment -

          Should we then also fix the CompositeReader ctor check to be consistent?

          Show
          Uwe Schindler added a comment - Should we then also fix the CompositeReader ctor check to be consistent?
          Hide
          Michael McCandless added a comment -

          Should we then also fix the CompositeReader ctor check to be consistent?

          I think so ... I'll do that.

          Show
          Michael McCandless added a comment - Should we then also fix the CompositeReader ctor check to be consistent? I think so ... I'll do that.
          Hide
          Uwe Schindler added a comment -

          Yeah, I'll set the limit (public static int on IndexWriter) to ArrayUtil.MAX_ARRAY_LENGTH

          I am not sure if this is the best idea. This constant is now dynamic, so it could happen that one IndexWriter with a 32 bit JVM creates an Index that not readable with another JVM (e.g., 64 bits and different vendor), because the constant changes.

          I am fine with using the dynamic constant for stuff like overallocating arrays and so on, but we should hardcode the maximum document number in an Index system independent.

          Show
          Uwe Schindler added a comment - Yeah, I'll set the limit (public static int on IndexWriter) to ArrayUtil.MAX_ARRAY_LENGTH I am not sure if this is the best idea. This constant is now dynamic, so it could happen that one IndexWriter with a 32 bit JVM creates an Index that not readable with another JVM (e.g., 64 bits and different vendor), because the constant changes. I am fine with using the dynamic constant for stuff like overallocating arrays and so on, but we should hardcode the maximum document number in an Index system independent.
          Hide
          Michael McCandless added a comment -

          I am fine with using the dynamic constant for stuff like overallocating arrays and so on, but we should hardcode the maximum document number in an Index system independent.

          Hmm, good point ... I'll make it fixed.

          Show
          Michael McCandless added a comment - I am fine with using the dynamic constant for stuff like overallocating arrays and so on, but we should hardcode the maximum document number in an Index system independent. Hmm, good point ... I'll make it fixed.
          Hide
          Michael McCandless added a comment -

          Patch, just adding a counter to IndexWriter that tracks how many docs
          are in the index.

          I added public static final int IndexWriter.MAX_DOCS with the limit
          set to ArrayUtil.MAX_ARRAY_LENGTH, and added a new
          oal.index.IndexFullException, thrown by any method that add docs to
          the index.

          Internally, the counter acts like a credit card: before any operation
          is allowed to actually modify the index / change the segmentInfos or
          DWPT state (via addIndexes, add/updateDocument/s), it first "charges"
          the credit card and if the charge fails it throws IndexFullException.

          When a merge commits, I deduct the deleted docs it had reclaimed, and
          I also deduct in places where we drop 100% deleted segments.

          I didn't add a setting to IWC for this ... I think that's sort of
          odd.

          Show
          Michael McCandless added a comment - Patch, just adding a counter to IndexWriter that tracks how many docs are in the index. I added public static final int IndexWriter.MAX_DOCS with the limit set to ArrayUtil.MAX_ARRAY_LENGTH, and added a new oal.index.IndexFullException, thrown by any method that add docs to the index. Internally, the counter acts like a credit card: before any operation is allowed to actually modify the index / change the segmentInfos or DWPT state (via addIndexes, add/updateDocument/s), it first "charges" the credit card and if the charge fails it throws IndexFullException. When a merge commits, I deduct the deleted docs it had reclaimed, and I also deduct in places where we drop 100% deleted segments. I didn't add a setting to IWC for this ... I think that's sort of odd.
          Hide
          Shai Erera added a comment -

          I added public static final int IndexWriter.MAX_DOCS with the limit
          set to ArrayUtil.MAX_ARRAY_LENGTH.

          But MAX_ARRAY_LENGTH is dynamic, and depends on the JRE (32/64-bit). So that's not "fixed" across JVMs, right?

          Show
          Shai Erera added a comment - I added public static final int IndexWriter.MAX_DOCS with the limit set to ArrayUtil.MAX_ARRAY_LENGTH. But MAX_ARRAY_LENGTH is dynamic, and depends on the JRE (32/64-bit). So that's not "fixed" across JVMs, right?
          Hide
          Robert Muir added a comment -

          Can it be INT_MAX-8 for this reason?

          Show
          Robert Muir added a comment - Can it be INT_MAX-8 for this reason?
          Hide
          Shai Erera added a comment -

          Yes, I think we shouldn't try to be too smart here. It can even be MAX_INT-1024 for all practical purposes (and if we want to be on the safe side w/ int[] allocations), as I doubt anyone will complain he cannot put MAX_INT-1023 docs in an index...

          Show
          Shai Erera added a comment - Yes, I think we shouldn't try to be too smart here. It can even be MAX_INT-1024 for all practical purposes (and if we want to be on the safe side w/ int[] allocations), as I doubt anyone will complain he cannot put MAX_INT-1023 docs in an index...
          Hide
          Michael McCandless added a comment -

          Sorry, my description was stale: in the patch I settled on MAX_INT - 128 as a "defensive" attempt to be hopefully well below the min value for the MAX_ARRAY_LENGTH over normal JVMs ...

          Show
          Michael McCandless added a comment - Sorry, my description was stale: in the patch I settled on MAX_INT - 128 as a "defensive" attempt to be hopefully well below the min value for the MAX_ARRAY_LENGTH over normal JVMs ...
          Hide
          Shai Erera added a comment -

          oh good, I didn't read the patch before, but I see you even explain why we don't use the constant from ArrayUtil! +1

          Show
          Shai Erera added a comment - oh good, I didn't read the patch before, but I see you even explain why we don't use the constant from ArrayUtil! +1
          Hide
          Michael McCandless added a comment -

          New patch, I think it's ready: I resolved the nocommits, and removed
          Test2BDocs (I think its tests are folded into
          TestIndexWriterMaxDocs).

          Show
          Michael McCandless added a comment - New patch, I think it's ready: I resolved the nocommits, and removed Test2BDocs (I think its tests are folded into TestIndexWriterMaxDocs).
          Hide
          Hoss Man added a comment -

          Mike: couple quick suggestions...

          • private static int actualMaxDocs = MAX_DOCS;
            static void setMaxDocs(int docs) {
              if (MAX_DOCS < docs) throw UglyException
              actualMaxDocs = docs;
            }
            

            ...that way some poor bastard who sees it in the code and tries to be crafty and add a stub class to that package to set it to Integer.MAX_VALUE will get an immedaite error instead of a timebomb.

          • add a public method to the test-framework that wraps this package protected setter, so that tests in other packages besides org.apache.lucene.index can mutate this.
            • then we can add tests for clean behavior in solr as well (not to mention anybody else who writes a Lucene app and wants to test for how their app behaves when the index gets too big w/o adding an org/apache/lucene/index dir to their test source
          Show
          Hoss Man added a comment - Mike: couple quick suggestions... private static int actualMaxDocs = MAX_DOCS; static void setMaxDocs( int docs) { if (MAX_DOCS < docs) throw UglyException actualMaxDocs = docs; } ...that way some poor bastard who sees it in the code and tries to be crafty and add a stub class to that package to set it to Integer.MAX_VALUE will get an immedaite error instead of a timebomb. add a public method to the test-framework that wraps this package protected setter, so that tests in other packages besides org.apache.lucene.index can mutate this. then we can add tests for clean behavior in solr as well (not to mention anybody else who writes a Lucene app and wants to test for how their app behaves when the index gets too big w/o adding an org/apache/lucene/index dir to their test source
          Hide
          Michael McCandless added a comment -

          Thanks Hoss, those are good ideas ... I'll fold them in.

          Show
          Michael McCandless added a comment - Thanks Hoss, those are good ideas ... I'll fold them in.
          Hide
          Erick Erickson added a comment -

          Heck, as far as I'm concerned, tripping an exception when we're 1M below the limit is fine .

          At those numbers, a million between friends is nothing...

          Show
          Erick Erickson added a comment - Heck, as far as I'm concerned, tripping an exception when we're 1M below the limit is fine . At those numbers, a million between friends is nothing...
          Hide
          Michael McCandless added a comment -

          New patch w/ Hoss's feedback ... I think it's ready.

          Show
          Michael McCandless added a comment - New patch w/ Hoss's feedback ... I think it's ready.
          Hide
          Michael McCandless added a comment -

          New patch, I just removed the custom exc and throw IllegalStateException. I'll commit soon...

          Show
          Michael McCandless added a comment - New patch, I just removed the custom exc and throw IllegalStateException. I'll commit soon...
          Hide
          ASF subversion and git services added a comment -

          Commit 1614402 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1614402 ]

          LUCENE-5843: IndexWriter now enforces max docs in one index

          Show
          ASF subversion and git services added a comment - Commit 1614402 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1614402 ] LUCENE-5843 : IndexWriter now enforces max docs in one index
          Hide
          ASF subversion and git services added a comment -

          Commit 1614422 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1614422 ]

          LUCENE-5843: this is a bug not a feature!

          Show
          ASF subversion and git services added a comment - Commit 1614422 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1614422 ] LUCENE-5843 : this is a bug not a feature!
          Hide
          ASF subversion and git services added a comment -

          Commit 1614423 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1614423 ]

          LUCENE-5843: woops, I committed too much

          Show
          ASF subversion and git services added a comment - Commit 1614423 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1614423 ] LUCENE-5843 : woops, I committed too much
          Hide
          ASF subversion and git services added a comment -

          Commit 1614424 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1614424 ]

          LUCENE-5843: this is a bug not a feature!

          Show
          ASF subversion and git services added a comment - Commit 1614424 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1614424 ] LUCENE-5843 : this is a bug not a feature!
          Hide
          ASF subversion and git services added a comment -

          Commit 1614425 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1614425 ]

          LUCENE-5843: this is a bug not a feature!

          Show
          ASF subversion and git services added a comment - Commit 1614425 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1614425 ] LUCENE-5843 : this is a bug not a feature!
          Hide
          Michael McCandless added a comment -

          Reopen for backport to 4.9.1...

          Show
          Michael McCandless added a comment - Reopen for backport to 4.9.1...
          Hide
          ASF subversion and git services added a comment -

          Commit 1625415 from Michael McCandless in branch 'dev/branches/lucene_solr_4_9'
          [ https://svn.apache.org/r1625415 ]

          LUCENE-5843: backport to 4.9.x

          Show
          ASF subversion and git services added a comment - Commit 1625415 from Michael McCandless in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625415 ] LUCENE-5843 : backport to 4.9.x
          Hide
          Michael McCandless added a comment -

          Bulk close for Lucene/Solr 4.9.1 release

          Show
          Michael McCandless added a comment - Bulk close for Lucene/Solr 4.9.1 release

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development