Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Operating System: Linux
      Platform: PC

      Description

      Hello,
      I can not find out, why (and what) it is happening all the time. I got an
      exception:
      java.lang.IllegalStateException: docs out of order
      at
      org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
      at
      org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
      at
      org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
      at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
      at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
      at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
      at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
      at Optimize.main(Optimize.java:29)

      It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find
      it neither in download nor in version list in this form). Everything seems OK. I
      can search through index, but I can not optimize it. Even worse after this
      exception every time I add new documents and close IndexWriter new segments is
      created! I think it has all documents added before, because of its size.

      My index is quite big: 500.000 docs, about 5gb of index directory.

      It is repeatable. I drop index, reindex everything. Afterwards I add a few
      docs, try to optimize and receive above exception.

      My documents' structure is:
      static Document indexIt(String id_strony, Reader reader, String data_wydania,
      String id_wydania, String id_gazety, String data_wstawienia)
      {
      Document doc = new Document();

      doc.add(Field.Keyword("id", id_strony ));
      doc.add(Field.Keyword("data_wydania", data_wydania));
      doc.add(Field.Keyword("id_wydania", id_wydania));
      doc.add(Field.Text("id_gazety", id_gazety));
      doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
      doc.add(Field.Text("tresc", reader));

      return doc;
      }

      Sincerely,
      legez

      1. ASF.LICENSE.NOT.GRANTED--bug23650.txt
        2 kB
        Arvind Srinivasan
      2. corrupted.part1.rar
        9.00 MB
        Jarrod Cuzens
      3. corrupted.part2.rar
        3.35 MB
        Jarrod Cuzens
      4. LUCENE-140-2007-01-09-instrumentation.patch
        6 kB
        Michael McCandless
      5. indexing-failure.log
        2.75 MB
        Jed Wesley-Smith

        Activity

        Hide
        Otis Gospodnetic added a comment -

        Since the bug submitter hasn't followed up to this 'repeatable' issue in almost
        two months, I shall assume that this was not a bug in Lucene, but a misuse of
        Lucene.

        There is also no self-contained code that demonstrates the bug.

        If this is indeed a Lucene bug, please re-open this bug entry, and provise
        self-sufficient unit test that demonstrates this problem.

        Show
        Otis Gospodnetic added a comment - Since the bug submitter hasn't followed up to this 'repeatable' issue in almost two months, I shall assume that this was not a bug in Lucene, but a misuse of Lucene. There is also no self-contained code that demonstrates the bug. If this is indeed a Lucene bug, please re-open this bug entry, and provise self-sufficient unit test that demonstrates this problem.
        Hide
        Arvind Srinivasan added a comment -

        More Data Integrity Issue: Docs out of Order

        Hi,
        Seeing similar issue to the one reported in:
        http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650
        On examining the segments, following inconsistencies were found
        (a) The merging segments had doc number that is greater than maxDoc.
        Don't know how it go in this state, but this occurs using standard lucene
        code.
        (b) Strangely, some documents had terms with zero frequency. And when it
        occurred,
        the zero frequency term has several posting as (docid 0)
        Example.. (docid freq) – MaxDoc = 7749 - NO DELETION.
        Merging msgBody; text=it; sz=2 — The field name is msgBody and term is "it"
        and two segments have the term.
        (0 0)(0 0)(0 0)..........(0 0)(4 6)(5 3)(6 1)(9 1)(10 2)(12 1)......
        ...(6791 2)(6794 3)(6796 2)(6798 16)(6801 1)(6805 1)(6806 5)
        (6808 1)(6810 1)(6815 2)(6816 3)(6817 1)(6818 1)(6821 4)(6822 1)
        (6824 3)(6826 4)(6828 1)(6829 3)(12549 2)doc exceeds count
        749(13570 1)doc exceeds count 7749(14896 1)doc exceeds count 7749
        (15028 1)doc exceeds count 7749(15357 1)doc exceeds count 7749
        (15427 1)doc exceeds count 7749(15534 1)doc exceeds count 7749
        (15535 1)doc exceeds count 7749(15653 1)doc exceeds count 7749
        (16530 1)doc exceeds count 7749(17108 1).......
        (c) Also the zero frequency was not limited to the 0 document, there was
        another instance.

        One work around that seemed to resolve the issue was to:
        (a) keep the maxDoc as a member variable in SegmentMergeInfo
        and ignore/throw exception if an inconsistent state is detected.

        ****ADD To SegmentMerger just before "docs out of order" check.
        if (postings.freq() == 0)

        { continue; }

        if (doc >= smi.maxDoc)

        { //sbLog.append("doc exceeds count \r\n " + smi.maxDoc); continue; }

        ****

        Atleast putting a check would not corrupt the segments and would
        get us closer to the real problem as to why freq=0 and doc number exceeds
        maxDoc. Note, the code has had the fix to the other Segment corruption issue
        that I previously reported (namely, Using a zero length file).

        Show
        Arvind Srinivasan added a comment - More Data Integrity Issue: Docs out of Order Hi, Seeing similar issue to the one reported in: http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650 On examining the segments, following inconsistencies were found (a) The merging segments had doc number that is greater than maxDoc. Don't know how it go in this state, but this occurs using standard lucene code. (b) Strangely, some documents had terms with zero frequency. And when it occurred, the zero frequency term has several posting as (docid 0) Example.. (docid freq) – MaxDoc = 7749 - NO DELETION. Merging msgBody; text=it; sz=2 — The field name is msgBody and term is "it" and two segments have the term. (0 0)(0 0)(0 0)..........(0 0)(4 6)(5 3)(6 1)(9 1)(10 2)(12 1)...... ...(6791 2)(6794 3)(6796 2)(6798 16)(6801 1)(6805 1)(6806 5) (6808 1)(6810 1)(6815 2)(6816 3)(6817 1)(6818 1)(6821 4)(6822 1) (6824 3)(6826 4)(6828 1)(6829 3)(12549 2)doc exceeds count 749(13570 1)doc exceeds count 7749(14896 1)doc exceeds count 7749 (15028 1)doc exceeds count 7749(15357 1)doc exceeds count 7749 (15427 1)doc exceeds count 7749(15534 1)doc exceeds count 7749 (15535 1)doc exceeds count 7749(15653 1)doc exceeds count 7749 (16530 1)doc exceeds count 7749(17108 1)....... (c) Also the zero frequency was not limited to the 0 document, there was another instance. One work around that seemed to resolve the issue was to: (a) keep the maxDoc as a member variable in SegmentMergeInfo and ignore/throw exception if an inconsistent state is detected. ****ADD To SegmentMerger just before "docs out of order" check. if (postings.freq() == 0) { continue; } if (doc >= smi.maxDoc) { //sbLog.append("doc exceeds count \r\n " + smi.maxDoc); continue; } **** Atleast putting a check would not corrupt the segments and would get us closer to the real problem as to why freq=0 and doc number exceeds maxDoc. Note, the code has had the fix to the other Segment corruption issue that I previously reported (namely, Using a zero length file).
        Hide
        Arvind Srinivasan added a comment -

        Created an attachment (id=15405)
        Analysis of corrupted segments and suggestions.

        Show
        Arvind Srinivasan added a comment - Created an attachment (id=15405) Analysis of corrupted segments and suggestions.
        Hide
        Otis Gospodnetic added a comment -

        Arvind, thanks for following up on this issue. From your report I can see that
        the index really ends up containing invalid data, but I've never even seen this
        happen myself. Could you please put together sample code that gets the index in
        this state?

        Show
        Otis Gospodnetic added a comment - Arvind, thanks for following up on this issue. From your report I can see that the index really ends up containing invalid data, but I've never even seen this happen myself. Could you please put together sample code that gets the index in this state?
        Hide
        Yonik Seeley added a comment -

        I've never seen this... can anyone reproduce with Lucene 1.9?
        CCing this to Arvind's email...

        Show
        Yonik Seeley added a comment - I've never seen this... can anyone reproduce with Lucene 1.9? CCing this to Arvind's email...
        Hide
        Otis Gospodnetic added a comment -

        2 years later, I still haven't seen this error.

        Show
        Otis Gospodnetic added a comment - 2 years later, I still haven't seen this error.
        Hide
        Jarrod Cuzens added a comment -

        First I think that Lucene is great and it manages to do an incredible job. However, this issue is also causing us significant problems. We originally had an AOP interceptor that would update documents in our HTTP threads however when we started to see this issue we were concerned that it was caused by multiple threads accessing the index at the same time. We put extra concurrency controls on the updates using the LuceneIndexAccess API that was posted into bugzilla by another user. This issue still remained after we added the extra concurrency control.

        Since then we have abandoned the AOP approach completely and moved the index rebuild to a scheduled thread which collects things that were modified from the database (using datestamp versioning) and then proceeds to update their lucene indexes. We hoped this would solve the problem because only 1 thread in 1 process would be modifying the index at any given time. Alas, we are still getting docs out of order exceptions. It is difficult for us to reproduce as it mainly happens in production and we cannot provide a testcase for it (I wish we could!).

        I know that this must be a difficult issue because it is difficult to reproduce. I can't think of anything strange that we are doing with the indexes (one process, one thread modifying). This probably isn't much help but I just wanted to let you know that we are also experiencing the problem.

        Show
        Jarrod Cuzens added a comment - First I think that Lucene is great and it manages to do an incredible job. However, this issue is also causing us significant problems. We originally had an AOP interceptor that would update documents in our HTTP threads however when we started to see this issue we were concerned that it was caused by multiple threads accessing the index at the same time. We put extra concurrency controls on the updates using the LuceneIndexAccess API that was posted into bugzilla by another user. This issue still remained after we added the extra concurrency control. Since then we have abandoned the AOP approach completely and moved the index rebuild to a scheduled thread which collects things that were modified from the database (using datestamp versioning) and then proceeds to update their lucene indexes. We hoped this would solve the problem because only 1 thread in 1 process would be modifying the index at any given time. Alas, we are still getting docs out of order exceptions. It is difficult for us to reproduce as it mainly happens in production and we cannot provide a testcase for it (I wish we could!). I know that this must be a difficult issue because it is difficult to reproduce. I can't think of anything strange that we are doing with the indexes (one process, one thread modifying). This probably isn't much help but I just wanted to let you know that we are also experiencing the problem.
        Hide
        Doug Cutting added a comment -

        File corruption could cause this. Please look in your system logs to see if there are any reports of problems accessing the drive that stores this index.

        Show
        Doug Cutting added a comment - File corruption could cause this. Please look in your system logs to see if there are any reports of problems accessing the drive that stores this index.
        Hide
        Jarrod Cuzens added a comment -

        Hi Doug,

        Thanks for your suggestion (and great work on Lucene!). I looked in the logs and could not find any indication of corruption. In addition we actually have the system running in a cluster where each node in the cluster has its own lucene search index. We had the issue on a different machine before we moved to the cluster and we now see it on both machines in our cluster . Next time it occurs I will get the indexes and try to post them here if you think that would be helpful.

        Just some additional info:
        1) We are running SUSE 10 Linux.
        2) We run two quartz jobs: One job runs every 2 minutes and updates lucene with changes from the db if necessary. The other job runs at 3:30AM in the morning and does full index rebuilds. We use the LuceneIndexAccess api when working with the IndexReader/IndexWriter. Only one thread should modify the index at any time although it is possible that the full rebuild job and the update job execute concurrently they shouldn't both modify the index due to the LuceneIndexAccess API's synchronization.
        3) Currently when doing searches we create a new IndexSearcher everytime a search is to be performed; We do not use the LuceneIndexAccess API.

        When I refer to the LuceneIndexAccess API I am referring to the contribution by Mark Schreiber:
        http://www.blizzy.de/lucene/lucene-indexaccess-0.1.0.zip

        Thanks for any help!

        Show
        Jarrod Cuzens added a comment - Hi Doug, Thanks for your suggestion (and great work on Lucene!). I looked in the logs and could not find any indication of corruption. In addition we actually have the system running in a cluster where each node in the cluster has its own lucene search index. We had the issue on a different machine before we moved to the cluster and we now see it on both machines in our cluster . Next time it occurs I will get the indexes and try to post them here if you think that would be helpful. Just some additional info: 1) We are running SUSE 10 Linux. 2) We run two quartz jobs: One job runs every 2 minutes and updates lucene with changes from the db if necessary. The other job runs at 3:30AM in the morning and does full index rebuilds. We use the LuceneIndexAccess api when working with the IndexReader/IndexWriter. Only one thread should modify the index at any time although it is possible that the full rebuild job and the update job execute concurrently they shouldn't both modify the index due to the LuceneIndexAccess API's synchronization. 3) Currently when doing searches we create a new IndexSearcher everytime a search is to be performed; We do not use the LuceneIndexAccess API. When I refer to the LuceneIndexAccess API I am referring to the contribution by Mark Schreiber: http://www.blizzy.de/lucene/lucene-indexaccess-0.1.0.zip Thanks for any help!
        Hide
        Rafal Majewski added a comment -

        Hi,

        I have recently experienced the same problem on relase 1.4.3. It happended in production(more than once). Filesystem index directory is used. One application is accountable for indexing, another for searching the index(it also uses IndexReader for query rewriting). Access to index directory(which is singleton) is synchronized within each application's scope. These two applications create separate directory instances and access it independently. Unfortunately, the code doesn't lead to repeatable occurances of this exception. I assume that these two applications having not synchronized access to the index might couse the problem, but I have received information that it had also occured with the searching application being down.

        Thanks in advance for any help.

        Show
        Rafal Majewski added a comment - Hi, I have recently experienced the same problem on relase 1.4.3. It happended in production(more than once). Filesystem index directory is used. One application is accountable for indexing, another for searching the index(it also uses IndexReader for query rewriting). Access to index directory(which is singleton) is synchronized within each application's scope. These two applications create separate directory instances and access it independently. Unfortunately, the code doesn't lead to repeatable occurances of this exception. I assume that these two applications having not synchronized access to the index might couse the problem, but I have received information that it had also occured with the searching application being down. Thanks in advance for any help.
        Hide
        Jarrod Cuzens added a comment -

        I am posting our corrupted index (I have to do it in two parts because it is 14.5M). I looked at it in Luke but Luke doesn't really have any tools to help really diagnose corruption issues. At this point we are considering changing our system to do builds on one machine, test them, and then have them distributed to the other machines in our cluster.

        If anybody could look at this it would be greatly appreciated!
        Thanks!
        Jarrod

        Show
        Jarrod Cuzens added a comment - I am posting our corrupted index (I have to do it in two parts because it is 14.5M). I looked at it in Luke but Luke doesn't really have any tools to help really diagnose corruption issues. At this point we are considering changing our system to do builds on one machine, test them, and then have them distributed to the other machines in our cluster. If anybody could look at this it would be greatly appreciated! Thanks! Jarrod
        Hide
        Jarrod Cuzens added a comment -

        Second part.

        Show
        Jarrod Cuzens added a comment - Second part.
        Hide
        Ales Justin added a comment -

        We used Lucene 1.4.3. for a half year now.
        But just out of the sudden this issue appeared.

        We have synchronized access to index - synchronized singleton handling all write / read operations.

        Is there any progress on finding the cause?

        Show
        Ales Justin added a comment - We used Lucene 1.4.3. for a half year now. But just out of the sudden this issue appeared. We have synchronized access to index - synchronized singleton handling all write / read operations. Is there any progress on finding the cause?
        Hide
        Jason Lambert added a comment -

        I was having this problem intermittently while indexing over multiple threads and I have found that the following steps can cause this error (with Lucene 1.3 and 1.4.x):

        • Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index)
        • Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers
        • Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads)
        • Close the IndexWriter
        • Open another IndexReader (#2) over the index
        • Delete the previously found documents by their document ID numbers using reader #2
        • Close the #2 reader
        • Create another IndexWriter (#2) and re-add the updated documents
        • Close the IndexWriter #2
        • Close the original IndexReader (#1) and open a new reader for general searching

        If I ensure that the steps marked with an asterisk do not happen (i.e. using thread synchronization), I never get this error. Otherwise, it will happen intermittently while closing the second IndexWriter (#2) (sometimes I get an ArrayIndexOutOfBoundsException during the deletion). These 'extra' writes cause the initial 'segments' file to be updated after which it is re-read while opening the second IndexReader (#2).

        This can end up deleting documents using possibly non-existent IDs, most likely causing the index corruption that this error signals.

        Show
        Jason Lambert added a comment - I was having this problem intermittently while indexing over multiple threads and I have found that the following steps can cause this error (with Lucene 1.3 and 1.4.x): Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index) Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads) Close the IndexWriter Open another IndexReader (#2) over the index Delete the previously found documents by their document ID numbers using reader #2 Close the #2 reader Create another IndexWriter (#2) and re-add the updated documents Close the IndexWriter #2 Close the original IndexReader (#1) and open a new reader for general searching If I ensure that the steps marked with an asterisk do not happen (i.e. using thread synchronization), I never get this error. Otherwise, it will happen intermittently while closing the second IndexWriter (#2) (sometimes I get an ArrayIndexOutOfBoundsException during the deletion). These 'extra' writes cause the initial 'segments' file to be updated after which it is re-read while opening the second IndexReader (#2). This can end up deleting documents using possibly non-existent IDs, most likely causing the index corruption that this error signals.
        Hide
        Jed Wesley-Smith added a comment -

        We have seen this one as well. We don't have the same usage as above, we only ever delete documents with IndexReader.deleteDocuments(Term)

        We are using Lucene 1.9.1

        It occurs in two places, inside IndexWriter.addDocument():

        java.lang.IllegalStateException: docs out of order
        at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method)
        at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
        at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658)
        at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646)
        at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453)
        at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436)

        and inside IndexWriter.close():

        java.lang.IllegalStateException: docs out of order
        at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method)
        at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
        at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658)
        at org.apache.lucene.index.IndexWriter.flushRamSegments()V(IndexWriter.java:628)
        at org.apache.lucene.index.IndexWriter.close()V(IndexWriter.java:375)

        The second one exposes a problem in the close() method which is that the index write.lock is not released when exceptions are thrown in close() causing subsequent attempts to open an IndexWriter to fail.

        Show
        Jed Wesley-Smith added a comment - We have seen this one as well. We don't have the same usage as above, we only ever delete documents with IndexReader.deleteDocuments(Term) We are using Lucene 1.9.1 It occurs in two places, inside IndexWriter.addDocument(): java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436) and inside IndexWriter.close(): java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.flushRamSegments()V(IndexWriter.java:628) at org.apache.lucene.index.IndexWriter.close()V(IndexWriter.java:375) The second one exposes a problem in the close() method which is that the index write.lock is not released when exceptions are thrown in close() causing subsequent attempts to open an IndexWriter to fail.
        Hide
        Michael McCandless added a comment -

        I just resolved LUCENE-603 as a dup of this issue.

        It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!!

        Show
        Michael McCandless added a comment - I just resolved LUCENE-603 as a dup of this issue. It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!!
        Hide
        Jed Wesley-Smith added a comment -

        We have now seen this in a number of customer sites since upgrading JIRA to use Lucene 1.9.1. The JIRA report is here: http://jira.atlassian.com/browse/JRA-11861

        We only seem to have seen it since the upgrade from 1.4.3 to 1.9.1, we hadn't seen it before then.

        This is now a major issue for us, it is hitting a number of our customers. I am trying to generate a repeatable test for it as a matter of urgency.

        As a follow-up we sometimes see the old ArrayIndexOutOfBoundsEx in BitVector.get() (BitVector.java:63)

        will post more if I find something worth sharing.

        Show
        Jed Wesley-Smith added a comment - We have now seen this in a number of customer sites since upgrading JIRA to use Lucene 1.9.1. The JIRA report is here: http://jira.atlassian.com/browse/JRA-11861 We only seem to have seen it since the upgrade from 1.4.3 to 1.9.1, we hadn't seen it before then. This is now a major issue for us, it is hitting a number of our customers. I am trying to generate a repeatable test for it as a matter of urgency. As a follow-up we sometimes see the old ArrayIndexOutOfBoundsEx in BitVector.get() (BitVector.java:63) will post more if I find something worth sharing.
        Hide
        Jed Wesley-Smith added a comment -

        and we also see ArrayIndexOutOfBoundsEx in the SegmentReader.isDeleted() method:

        java.lang.ArrayIndexOutOfBoundsException
        at org.apache.lucene.index.SegmentReader.isDeleted(I)Z(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.mergeFields()I(Optimized Method)
        at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method)
        at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)

        Show
        Jed Wesley-Smith added a comment - and we also see ArrayIndexOutOfBoundsEx in the SegmentReader.isDeleted() method: java.lang.ArrayIndexOutOfBoundsException at org.apache.lucene.index.SegmentReader.isDeleted(I)Z(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeFields()I(Optimized Method) at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
        Hide
        Michael McCandless added a comment -

        OK: I finally found one way that this corruption can occur! I will
        create a unit test & commit a fix.

        If you delete by document number, and, that document number is larger
        than maxDoc, but only by a little, then the call to
        deletedDocs.set(num) may in fact succeed (ie, no exception), but will
        have set bits that are "out of bounds" in the BitVector's bits array.

        This is because the bits array is an array of bytes and so you can
        have up to 7 of these unused bits at the end. Once this has happened,
        any attempt to merge this segment will hit the "docs out of order"
        exception because the BitVector's count() method will count these
        "illegally set" bits and thus make the SegmentMerger think too many
        docs are deleted.

        Unfortunately, this case only occurs if you use deleteDocument(int),
        so I can't yet explain how this happens when using only
        deleteDocument(Term).

        Show
        Michael McCandless added a comment - OK: I finally found one way that this corruption can occur! I will create a unit test & commit a fix. If you delete by document number, and, that document number is larger than maxDoc, but only by a little, then the call to deletedDocs.set(num) may in fact succeed (ie, no exception), but will have set bits that are "out of bounds" in the BitVector's bits array. This is because the bits array is an array of bytes and so you can have up to 7 of these unused bits at the end. Once this has happened, any attempt to merge this segment will hit the "docs out of order" exception because the BitVector's count() method will count these "illegally set" bits and thus make the SegmentMerger think too many docs are deleted. Unfortunately, this case only occurs if you use deleteDocument(int), so I can't yet explain how this happens when using only deleteDocument(Term).
        Hide
        Michael McCandless added a comment -

        I've committed a fix for this one case to the trunk.

        I'm leaving the issue open so folks above can try the fix and confirm
        whether or not this fixes their cases.

        Jed (or any other folks who have hit this above and are still
        listening!), the fix is really trivial and would be easy to back
        port to prior releases: just run "svn diff -r 494135:494136" from
        a Lucene checkout to see them.

        If you are willing/able to try this in one of the environments where
        you keep hitting this issue, that would be awesome: if this is in fact
        your root cause, then you would see an ArrayIndexOutOfBoundsException
        at the point that the delete of a too-large docNum occurred (rather
        than silent corruption and the above exception much later that you now
        see); and if it's not your root cause after testing the fix, then we
        would know for sure to look for another cause here.

        Are you sure that you only ever do IndexReader.deleteDocuments(Term)
        and not deleteDocuments(int docNum)? I still can't explain how this
        error could happen without using that second method.

        Show
        Michael McCandless added a comment - I've committed a fix for this one case to the trunk. I'm leaving the issue open so folks above can try the fix and confirm whether or not this fixes their cases. Jed (or any other folks who have hit this above and are still listening!), the fix is really trivial and would be easy to back port to prior releases: just run "svn diff -r 494135:494136" from a Lucene checkout to see them. If you are willing/able to try this in one of the environments where you keep hitting this issue, that would be awesome: if this is in fact your root cause, then you would see an ArrayIndexOutOfBoundsException at the point that the delete of a too-large docNum occurred (rather than silent corruption and the above exception much later that you now see); and if it's not your root cause after testing the fix, then we would know for sure to look for another cause here. Are you sure that you only ever do IndexReader.deleteDocuments(Term) and not deleteDocuments(int docNum)? I still can't explain how this error could happen without using that second method.
        Hide
        Doron Cohen added a comment -

        Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things -

        (1) the sequence of ops brought by Jason is wrong:
        a Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index)
        b Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers
        c Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads)
        d Close the IndexWriter
        e Open another IndexReader (#2) over the index
        f Delete the previously found documents by their document ID numbers using reader #2
        g Close the #2 reader
        h Create another IndexWriter (#2) and re-add the updated documents
        i Close the IndexWriter #2
        j Close the original IndexReader (#1) and open a new reader for general searching

        Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case.

        (2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good.
        However when I first ran into this I didn't notice that BitVector.size() would become wrong as result of this - nice catch Mike!

        I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the "gotException assert" - if you comment this assert the test pass.

        The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it.

        public void testOutOfOrder () throws IOException {
        String tempDir = System.getProperty("java.io.tmpdir");
        if (tempDir == null)

        { throw new IOException("java.io.tmpdir undefined, cannot run test: "+getName()); }

        File indexDir = new File(tempDir, "lucenetestindexTemp");
        Directory dir = FSDirectory.getDirectory(indexDir, true);

        boolean create = true;
        int numDocs = 0;
        int maxDoc = 0;
        while (numDocs < 100) {
        IndexWriter iw = new IndexWriter(dir,anlzr,create);
        create = false;
        iw.setUseCompoundFile(false);
        for (int i=0; i<2; i++)

        { Document d = new Document(); d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED)); iw.addDocument(d); }

        iw.optimize();
        iw.close();
        IndexReader ir = IndexReader.open(dir);
        numDocs = ir.numDocs();
        maxDoc = ir.maxDoc();
        assertEquals(numDocs,maxDoc);
        for (int i=7; i >=1; i-) {
        try

        { ir.deleteDocument(maxDoc+i); }

        catch (ArrayIndexOutOfBoundsException e) {
        }
        }
        ir.close();
        }
        }

        Mike, do you agree?

        (3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit).

        This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657

        Not sure if this can be the case here but at least we can discuss whether it is better to always store the length.

        Show
        Doron Cohen added a comment - Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things - (1) the sequence of ops brought by Jason is wrong: a Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index) b Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers c Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads) d Close the IndexWriter e Open another IndexReader (#2) over the index f Delete the previously found documents by their document ID numbers using reader #2 g Close the #2 reader h Create another IndexWriter (#2) and re-add the updated documents i Close the IndexWriter #2 j Close the original IndexReader (#1) and open a new reader for general searching Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case. (2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good. However when I first ran into this I didn't notice that BitVector.size() would become wrong as result of this - nice catch Mike! I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the "gotException assert" - if you comment this assert the test pass. The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it. public void testOutOfOrder () throws IOException { String tempDir = System.getProperty("java.io.tmpdir"); if (tempDir == null) { throw new IOException("java.io.tmpdir undefined, cannot run test: "+getName()); } File indexDir = new File(tempDir, "lucenetestindexTemp"); Directory dir = FSDirectory.getDirectory(indexDir, true); boolean create = true; int numDocs = 0; int maxDoc = 0; while (numDocs < 100) { IndexWriter iw = new IndexWriter(dir,anlzr,create); create = false; iw.setUseCompoundFile(false); for (int i=0; i<2; i++) { Document d = new Document(); d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED)); iw.addDocument(d); } iw.optimize(); iw.close(); IndexReader ir = IndexReader.open(dir); numDocs = ir.numDocs(); maxDoc = ir.maxDoc(); assertEquals(numDocs,maxDoc); for (int i=7; i >= 1; i -) { try { ir.deleteDocument(maxDoc+i); } catch (ArrayIndexOutOfBoundsException e) { } } ir.close(); } } Mike, do you agree? (3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit). This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657 Not sure if this can be the case here but at least we can discuss whether it is better to always store the length.
        Hide
        Jed Wesley-Smith added a comment -

        Hi Michael,

        This is awesome, I have prepared a patched 1.9.1: http://jira.atlassian.com/secure/attachment/19390/lucene-core-1.9.1-atlassian-patched-2007-01-09.jar

        Unfortunately we don't have a repeatable test for this so we will have to distribute to afflicted customers and - well, pray I guess. We have been seeing this sporadically in our main JIRA instance http://jira.atlassian.com so we will hopefully not observe it now.

        We do only use the deleteDocuments(Term) method, so we are not sure whether this will truly fix our problem, but we note that that method calls deleteDocument(int) based on the TermDocs returned for the Term - and maybe they can be incorrect???

        Out of interest, apart from changing from 1.4.3 to 1.9.1, in the JIRA 3.7 release we changed our default merge factor to 4 from 10. We hadn't seen this problem before, and suddenly we have had a reasonable number of occurrences.

        Show
        Jed Wesley-Smith added a comment - Hi Michael, This is awesome, I have prepared a patched 1.9.1: http://jira.atlassian.com/secure/attachment/19390/lucene-core-1.9.1-atlassian-patched-2007-01-09.jar Unfortunately we don't have a repeatable test for this so we will have to distribute to afflicted customers and - well, pray I guess. We have been seeing this sporadically in our main JIRA instance http://jira.atlassian.com so we will hopefully not observe it now. We do only use the deleteDocuments(Term) method, so we are not sure whether this will truly fix our problem, but we note that that method calls deleteDocument(int) based on the TermDocs returned for the Term - and maybe they can be incorrect??? Out of interest, apart from changing from 1.4.3 to 1.9.1, in the JIRA 3.7 release we changed our default merge factor to 4 from 10. We hadn't seen this problem before, and suddenly we have had a reasonable number of occurrences.
        Hide
        Jed Wesley-Smith added a comment -

        Alas, this doesn't appear to be the problem. We are still getting it, but we do at least have a little more info. We added the doc and lastDoc to the IllegalArgEx and we are getting very strange numbers:

        java.lang.IllegalStateException: docs out of order (-1764 < 0)
        at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(SegmentMerger.java:335)
        at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(SegmentMerger.java:298)
        at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(SegmentMerger.java:272)
        at org.apache.lucene.index.SegmentMerger.mergeTerms()V(SegmentMerger.java:236)
        at org.apache.lucene.index.SegmentMerger.merge()I(SegmentMerger.java:89)
        at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
        at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658)
        at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646)
        at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453)
        at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436)

        where doc = -1764 and lastDoc is zero

        Show
        Jed Wesley-Smith added a comment - Alas, this doesn't appear to be the problem. We are still getting it, but we do at least have a little more info. We added the doc and lastDoc to the IllegalArgEx and we are getting very strange numbers: java.lang.IllegalStateException: docs out of order (-1764 < 0) at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(SegmentMerger.java:335) at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(SegmentMerger.java:298) at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(SegmentMerger.java:272) at org.apache.lucene.index.SegmentMerger.mergeTerms()V(SegmentMerger.java:236) at org.apache.lucene.index.SegmentMerger.merge()I(SegmentMerger.java:89) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436) where doc = -1764 and lastDoc is zero
        Hide
        Michael McCandless added a comment -

        Jed, thanks for testing the fix!

        > Alas, this doesn't appear to be the problem. We are still getting
        > it, but we do at least have a little more info. We added the doc and
        > lastDoc to the IllegalArgEx and we are getting very strange numbers:
        >
        > java.lang.IllegalStateException: docs out of order (-1764 < 0)
        > ...
        >
        > where doc = -1764 and lastDoc is zero

        OK, so we've definitely got something else at play here (bummer!). That
        (negative doc number) is very strange. I will keep looking a this. I
        will prepare a patch on 1.9.1 that adds some more instrumentation so
        we can get more details when you hit this ...

        > We do only use the deleteDocuments(Term) method, so we are not sure
        > whether this will truly fix our problem, but we note that that
        > method calls deleteDocument(int) based on the TermDocs returned for
        > the Term - and maybe they can be incorrect???

        Right, but I had thought the docNum's coming in from this path would
        be correct. It sounds like we have another issue at play here that
        can somehow get even these doc numbers messed up.

        > Out of interest, apart from changing from 1.4.3 to 1.9.1, in the
        > JIRA 3.7 release we changed our default merge factor to 4 from
        > 10. We hadn't seen this problem before, and suddenly we have had a
        > reasonable number of occurrences.

        Interesting. Maybe try changing back to 4 and see if it suppresses
        the bug? Might give us [desperately needed] more data to cling to
        here! On the 1.4.3 -> 1.9.1 change, some of the cases above were even
        pre-1.4.x (though they could have been from yet another root cause or
        maybe filesystem) so it's hard to draw hard conclusions on this
        front.

        Show
        Michael McCandless added a comment - Jed, thanks for testing the fix! > Alas, this doesn't appear to be the problem. We are still getting > it, but we do at least have a little more info. We added the doc and > lastDoc to the IllegalArgEx and we are getting very strange numbers: > > java.lang.IllegalStateException: docs out of order (-1764 < 0) > ... > > where doc = -1764 and lastDoc is zero OK, so we've definitely got something else at play here (bummer!). That (negative doc number) is very strange. I will keep looking a this. I will prepare a patch on 1.9.1 that adds some more instrumentation so we can get more details when you hit this ... > We do only use the deleteDocuments(Term) method, so we are not sure > whether this will truly fix our problem, but we note that that > method calls deleteDocument(int) based on the TermDocs returned for > the Term - and maybe they can be incorrect??? Right, but I had thought the docNum's coming in from this path would be correct. It sounds like we have another issue at play here that can somehow get even these doc numbers messed up. > Out of interest, apart from changing from 1.4.3 to 1.9.1, in the > JIRA 3.7 release we changed our default merge factor to 4 from > 10. We hadn't seen this problem before, and suddenly we have had a > reasonable number of occurrences. Interesting. Maybe try changing back to 4 and see if it suppresses the bug? Might give us [desperately needed] more data to cling to here! On the 1.4.3 -> 1.9.1 change, some of the cases above were even pre-1.4.x (though they could have been from yet another root cause or maybe filesystem) so it's hard to draw hard conclusions on this front.
        Hide
        Michael McCandless added a comment -

        Doron,

        > (1) the sequence of ops brought by Jason is wrong:
        > ...
        >
        > Problem here is that the docIDs found in (b) may be altered in step
        > (d) and so step (f) would delete the wrong docs. In particular, it
        > might attempt to delete ids that are out of the range. This might
        > expose exactly the BitVector problem, and would explain the whole
        > thing, but I too cannot see how it explains the delete-by-term case.

        Right, the case I fixed only happens when the Lucene
        deleteDocument(int docNum) is [slightly] mis-used. Ie if you are
        "playing by the rules" you would never have hit this bug. And this
        particular use case is indeed incorrect: doc numbers are only valid to
        the one reader that you got them from.

        > I think however that the test Mike added does not expose the docs
        > out of order bug - I tried this test without the fix and it only
        > fail on the "gotException assert" - if you comment this assert the
        > test pass.

        Huh, I see my test case (in IndexReader) indeed hitting the original
        "docs out of order" exception. If I take the current trunk and
        comment out the (one line) bounds check in BitVector.set and run that
        test, it hits the "docs out of order" exception.

        Are you sure you updated the change (to tighten the check to a <= from
        a <) to index/SegmentMerger.java? Because, I did indeed find that the
        test failed to fail when I first wrote it (but should have). So in
        digging why it didn't fail as expected, I found that the check for
        "docs out of order" missed the boundary case of the same doc number
        twice in a row. Once I fixed that, the test failed as expected.

        > (3) maxDoc() computation in SegmentReader is based (on some paths)
        > in RandomAccessFile.length(). IIRC I saw cases (in previous project)
        > where File.length() or RAF.length() (not sure which of the two) did
        > not always reflect real length, if the system was very busy IO wise,
        > unless FD.sync() was called (with performance hit).

        Yes I saw this too. From the follow-on discussion it sounds like we
        haven't found a specific known JVM bug here. Still, it does make me
        nervous that we rely on file length to derive maxDoc.

        In general I think we should rely on as little as possible from the
        file system (there are so many cross platform issues/differences) and
        instead explicitly store things like maxDoc into the index. I will
        open a separate Jira issue to track this. Also I will record this
        path in the instrumentation patch for 1.9.1 just to see if we are
        actually hitting something here (I think unlikely but possible).

        Show
        Michael McCandless added a comment - Doron, > (1) the sequence of ops brought by Jason is wrong: > ... > > Problem here is that the docIDs found in (b) may be altered in step > (d) and so step (f) would delete the wrong docs. In particular, it > might attempt to delete ids that are out of the range. This might > expose exactly the BitVector problem, and would explain the whole > thing, but I too cannot see how it explains the delete-by-term case. Right, the case I fixed only happens when the Lucene deleteDocument(int docNum) is [slightly] mis-used. Ie if you are "playing by the rules" you would never have hit this bug. And this particular use case is indeed incorrect: doc numbers are only valid to the one reader that you got them from. > I think however that the test Mike added does not expose the docs > out of order bug - I tried this test without the fix and it only > fail on the "gotException assert" - if you comment this assert the > test pass. Huh, I see my test case (in IndexReader) indeed hitting the original "docs out of order" exception. If I take the current trunk and comment out the (one line) bounds check in BitVector.set and run that test, it hits the "docs out of order" exception. Are you sure you updated the change (to tighten the check to a <= from a <) to index/SegmentMerger.java? Because, I did indeed find that the test failed to fail when I first wrote it (but should have). So in digging why it didn't fail as expected, I found that the check for "docs out of order" missed the boundary case of the same doc number twice in a row. Once I fixed that, the test failed as expected. > (3) maxDoc() computation in SegmentReader is based (on some paths) > in RandomAccessFile.length(). IIRC I saw cases (in previous project) > where File.length() or RAF.length() (not sure which of the two) did > not always reflect real length, if the system was very busy IO wise, > unless FD.sync() was called (with performance hit). Yes I saw this too. From the follow-on discussion it sounds like we haven't found a specific known JVM bug here. Still, it does make me nervous that we rely on file length to derive maxDoc. In general I think we should rely on as little as possible from the file system (there are so many cross platform issues/differences) and instead explicitly store things like maxDoc into the index. I will open a separate Jira issue to track this. Also I will record this path in the instrumentation patch for 1.9.1 just to see if we are actually hitting something here (I think unlikely but possible).
        Hide
        Michael McCandless added a comment -

        OK, I created LUCENE-767 for the "maxDoc should be explicitly stored in the index" issue.

        Show
        Michael McCandless added a comment - OK, I created LUCENE-767 for the "maxDoc should be explicitly stored in the index" issue.
        Hide
        Michael McCandless added a comment -

        Jed, one question: when you tested the fix, you fully rebuilt your
        index from scratch, right? Just want to verify that. You have to
        re-index because once the index is corrupted it will eventually hit
        the "docs out of order" exception even if you fix the original cause.

        OK I've prepared a patch off 1.9.1 (just attached it). The patch
        passes all unit tests on 1.9.1.

        It has the changes I committed to the trunk yesterday, plus
        instrumentation (messages printed to a PrintStream) to catch places
        where doc numbers are not correct.

        All messages I added print to a newly added infoStream static member
        of SegmentMerger. You can do SegmentMerger.setInfoStream(...) to
        change it (it defaults to System.err).

        Jed if you could get the error to re-occur with this patch and then
        post the resulting messages, that would be great. Hopefully it gives
        us enough information to find the source here or at least to have
        another iteration with yet more instrumentation. Thanks!

        Show
        Michael McCandless added a comment - Jed, one question: when you tested the fix, you fully rebuilt your index from scratch, right? Just want to verify that. You have to re-index because once the index is corrupted it will eventually hit the "docs out of order" exception even if you fix the original cause. OK I've prepared a patch off 1.9.1 (just attached it). The patch passes all unit tests on 1.9.1. It has the changes I committed to the trunk yesterday, plus instrumentation (messages printed to a PrintStream) to catch places where doc numbers are not correct. All messages I added print to a newly added infoStream static member of SegmentMerger. You can do SegmentMerger.setInfoStream(...) to change it (it defaults to System.err). Jed if you could get the error to re-occur with this patch and then post the resulting messages, that would be great. Hopefully it gives us enough information to find the source here or at least to have another iteration with yet more instrumentation. Thanks!
        Hide
        Jed Wesley-Smith added a comment -

        Hi Michael,

        Thanks for the patch, applied and recreated. Attached is the log.

        To be explicit, we are recreating the index via the IndexWriter ctor with the create flag set and then completely rebuilding the index. We are not completely deleting the entire directory. There ARE old index files (_.cfs & _.del) in the directory with updated timestamps that are months old. If I completely recreate the directory the problem does go away. This is a fairly trivial "fix", but we are still investigating as we want to know if this is indeed the problem, how we have come to make it prevalent, and what the root cause is.

        Thanks for all the help everyone.

        Show
        Jed Wesley-Smith added a comment - Hi Michael, Thanks for the patch, applied and recreated. Attached is the log. To be explicit, we are recreating the index via the IndexWriter ctor with the create flag set and then completely rebuilding the index. We are not completely deleting the entire directory. There ARE old index files (_ .cfs & _ .del) in the directory with updated timestamps that are months old. If I completely recreate the directory the problem does go away. This is a fairly trivial "fix", but we are still investigating as we want to know if this is indeed the problem, how we have come to make it prevalent, and what the root cause is. Thanks for all the help everyone.
        Hide
        Jed Wesley-Smith added a comment -

        BTW. We have looked at all the open files referenced by the VM when the indexing errors occur, and there does not seem to be any reference to the old index segment files, so I am not sure how those files are influencing this problem.

        Show
        Jed Wesley-Smith added a comment - BTW. We have looked at all the open files referenced by the VM when the indexing errors occur, and there does not seem to be any reference to the old index segment files, so I am not sure how those files are influencing this problem.
        Hide
        Doron Cohen added a comment -

        Jed, is it possible that when re-creating the index, while IndexWriter is constructed with create=true, FSDirectory is opened with create=false?
        I suspect so, because otherwise, old .del files would have been deleted.
        If indeed so, newly created segments, which have same names as segments in previous (bad) runs, when opened, would read the (bad) old .del file.
        This would possibly expose the bug fixed by Michael.
        I may be over speculating here, but if this is the case, it can also explain why changing the merge factor from 4 to 10 exposed the problem.

        In fact, let me speculate even further - if indeed when creating the index from scratch, the FSDirectory is (mistakenly) opened with create=false, as long as you always repeated the same sequencing of adding and deleting docs, you were likely to almost not suffer from this mistake, because segments created with same names as (old) .del files simply see docs as deleted before the docs are actually deleted by the program. The search behaves wrongly, not finding these docs before they are actually deleted, but no exception is thrown when adding docs. However once the merge factor was changed from 4 to 10, the matching between old .del files and new segments (with same names) was broken, and the out-of-order exception appeared.

        ...and if this is not the case, we would need to look for something else...

        Show
        Doron Cohen added a comment - Jed, is it possible that when re-creating the index, while IndexWriter is constructed with create=true, FSDirectory is opened with create=false? I suspect so, because otherwise, old .del files would have been deleted. If indeed so, newly created segments, which have same names as segments in previous (bad) runs, when opened, would read the (bad) old .del file. This would possibly expose the bug fixed by Michael. I may be over speculating here, but if this is the case, it can also explain why changing the merge factor from 4 to 10 exposed the problem. In fact, let me speculate even further - if indeed when creating the index from scratch, the FSDirectory is (mistakenly) opened with create=false, as long as you always repeated the same sequencing of adding and deleting docs, you were likely to almost not suffer from this mistake, because segments created with same names as (old) .del files simply see docs as deleted before the docs are actually deleted by the program. The search behaves wrongly, not finding these docs before they are actually deleted, but no exception is thrown when adding docs. However once the merge factor was changed from 4 to 10, the matching between old .del files and new segments (with same names) was broken, and the out-of-order exception appeared. ...and if this is not the case, we would need to look for something else...
        Hide
        Michael McCandless added a comment -

        OK from that indexing-failure.log (thanks Jed!) I can see that indeed
        there are segments whose maxDoc() is much smaller than
        deleteDocs.count(). This then leads to negative doc numbers on
        merging these segments.

        Jed when you say "there are old files (_.cfs & _.del) in this
        directory with updated timestamps that are months old" what do you
        mean by "with updated timestamps"? Which timestamp is months old and
        which one is updated?

        OK, assuming Jed you are indeed sending "create=false" when creating
        the Directory and then passing that directory to IndexWriter with
        create=true, I think we now have this case fully explained (thanks
        Doron): your old _*.del files are being incorrectly opened & re-used
        by Lucene, when they should not be.

        Lucene (all released versions but not the trunk version, see below)
        does a simple fileExists("_XXX.del") call to determine if a segment
        XXX has deletes.

        But when that _XXX.del is a leftover from a previous index, it very
        likely doesn't "match" the newly created _XXX segment. (Especially if
        merge factor has changed but also if order of operations has changed,
        which I would expect in this use case).

        If that file exists, Lucene assumes it's for this segment and so opens
        it and uses it. If it happens that this _XXX.del file has more
        documents in it than the newly created _XXX.cfs segment, then negative
        doc numbers will result (and then later cause the "docs out of order"
        exception). If it happens that the _XXX.del file has fewer documents
        than the newly created _XXX.cfs segment then you'll hit
        ArrayIndexOutOfBounds exceptions in calls to isDeleted(...). If they
        are exactly equal then you'd randomly see some of your docs got
        deleted.

        Note that the trunk version of Lucene has already fixed this bug (as
        part of lockless commits):

        • Whether a segment has deletions or not is now explictly stored in
          the segments file rather than relying on a "fileExists(...)" call.
          So, if an old _XXX.del existed in the filesystem, the newly
          created _XXX segment would not open it.
        • Furthermore, the trunk version of Lucene uses a new
          IndexFileDelter class to remove any unreferenced index files.
          This means it would have removed these old _.cfs and _.del files
          even in the case where a directory was created with "create=false"
          and the IndexWriter was created with "create=true".

        To summarize:

        • There was one case where if you gave slightly illegal doc numbers
          (within 7 of the actual maxDoc) Lucene may silently accept the
          call but would corrupt your index only to be seen later as an
          "docs out of order" IllegalStateException when the segment is
          merged. This was just a missing boundary case check. This case
          is now fixed in the trunk (you get an
          ArrayIndexOutOfBoundsException if doc number is too large).
        • There is also another case, that only happens if you have old
          _*.del files leftover from a previous index while re-creating a
          new index.

        The workaround is simple here: always open the Directory with
        create=true (or, remove the directory contents yourself before
        hand). (IndexWriter does this if you give it a String or File
        with create=true).

        This is really a bug in Lucene, but given that it's already fixed
        in the trunk, and the workaround is simple, I'm inclined to not
        fix it in prior releases and instead publicize the issue (I will
        do so on java-user).

        But, I will commit two additional IllegalStateException checks to
        the trunk when a segment is first initialized: 1) check that the
        two different sources of "maxDoc" (fieldsReader.size() and
        si.docCount) are the same, and 2) check that the number of pending
        deletions does not exceed maxDoc(). When an index has
        inconsistency I think the earlier it's detected the better.

        Show
        Michael McCandless added a comment - OK from that indexing-failure.log (thanks Jed!) I can see that indeed there are segments whose maxDoc() is much smaller than deleteDocs.count(). This then leads to negative doc numbers on merging these segments. Jed when you say "there are old files (_ .cfs & _ .del) in this directory with updated timestamps that are months old" what do you mean by "with updated timestamps"? Which timestamp is months old and which one is updated? OK, assuming Jed you are indeed sending "create=false" when creating the Directory and then passing that directory to IndexWriter with create=true, I think we now have this case fully explained (thanks Doron): your old _*.del files are being incorrectly opened & re-used by Lucene, when they should not be. Lucene (all released versions but not the trunk version, see below) does a simple fileExists("_XXX.del") call to determine if a segment XXX has deletes. But when that _XXX.del is a leftover from a previous index, it very likely doesn't "match" the newly created _XXX segment. (Especially if merge factor has changed but also if order of operations has changed, which I would expect in this use case). If that file exists, Lucene assumes it's for this segment and so opens it and uses it. If it happens that this _XXX.del file has more documents in it than the newly created _XXX.cfs segment, then negative doc numbers will result (and then later cause the "docs out of order" exception). If it happens that the _XXX.del file has fewer documents than the newly created _XXX.cfs segment then you'll hit ArrayIndexOutOfBounds exceptions in calls to isDeleted(...). If they are exactly equal then you'd randomly see some of your docs got deleted. Note that the trunk version of Lucene has already fixed this bug (as part of lockless commits): Whether a segment has deletions or not is now explictly stored in the segments file rather than relying on a "fileExists(...)" call. So, if an old _XXX.del existed in the filesystem, the newly created _XXX segment would not open it. Furthermore, the trunk version of Lucene uses a new IndexFileDelter class to remove any unreferenced index files. This means it would have removed these old _ .cfs and _ .del files even in the case where a directory was created with "create=false" and the IndexWriter was created with "create=true". To summarize: There was one case where if you gave slightly illegal doc numbers (within 7 of the actual maxDoc) Lucene may silently accept the call but would corrupt your index only to be seen later as an "docs out of order" IllegalStateException when the segment is merged. This was just a missing boundary case check. This case is now fixed in the trunk (you get an ArrayIndexOutOfBoundsException if doc number is too large). There is also another case, that only happens if you have old _*.del files leftover from a previous index while re-creating a new index. The workaround is simple here: always open the Directory with create=true (or, remove the directory contents yourself before hand). (IndexWriter does this if you give it a String or File with create=true). This is really a bug in Lucene, but given that it's already fixed in the trunk, and the workaround is simple, I'm inclined to not fix it in prior releases and instead publicize the issue (I will do so on java-user). But, I will commit two additional IllegalStateException checks to the trunk when a segment is first initialized: 1) check that the two different sources of "maxDoc" (fieldsReader.size() and si.docCount) are the same, and 2) check that the number of pending deletions does not exceed maxDoc(). When an index has inconsistency I think the earlier it's detected the better.
        Hide
        Michael McCandless added a comment -

        > BTW. We have looked at all the open files referenced by the VM when
        > the indexing errors occur, and there does not seem to be any reference
        > to the old index segment files, so I am not sure how those files are
        > influencing this problem.

        Jed just to answer this question: the _XXX.del files are opened very
        briefly because the contents of this file are loaded / cached in
        memory, and the the file handle is closed. I don't think the _XXX.cfs
        files are affecting this issue (are not opened).

        Show
        Michael McCandless added a comment - > BTW. We have looked at all the open files referenced by the VM when > the indexing errors occur, and there does not seem to be any reference > to the old index segment files, so I am not sure how those files are > influencing this problem. Jed just to answer this question: the _XXX.del files are opened very briefly because the contents of this file are loaded / cached in memory, and the the file handle is closed. I don't think the _XXX.cfs files are affecting this issue (are not opened).
        Hide
        Jed Wesley-Smith added a comment -

        Michael, Doron, you guys are legends!

        Indeed the problem is using only the IndexWriter with create true to recreate the directory. Creating a new Directory with create true does fix the problem. The javadoc for this constructor is fairly explicit that it should recreate the index for you (no caveat), so I would consider that a bug, but - given that head fixes it - not one that requires any action.

        Thanks guys for the prompt attention, excellent and thorough analysis.

        Show
        Jed Wesley-Smith added a comment - Michael, Doron, you guys are legends! Indeed the problem is using only the IndexWriter with create true to recreate the directory. Creating a new Directory with create true does fix the problem. The javadoc for this constructor is fairly explicit that it should recreate the index for you (no caveat), so I would consider that a bug, but - given that head fixes it - not one that requires any action. Thanks guys for the prompt attention, excellent and thorough analysis.
        Hide
        Michael McCandless added a comment -

        Phew! I'm glad we finally got to the bottom of this one.

        Thank you for your persistent and fast testing iterations, Jed; this
        issue has been open for far too long!

        I will send a summary email to java-user and resolve this issue,
        finally.

        Show
        Michael McCandless added a comment - Phew! I'm glad we finally got to the bottom of this one. Thank you for your persistent and fast testing iterations, Jed; this issue has been open for far too long! I will send a summary email to java-user and resolve this issue, finally.
        Hide
        Michael McCandless added a comment -

        Actually, this reminds me that, as of lockless commits, there is one
        important tradeoff on which "create=true" to use (the case on windows
        where you want to re-create the index but readers are currently using
        it). I will call out this difference in the javadocs.

        Although, why do we even have a "create" parameter in the directory?
        I think it's confusing (and dangerous, pre-trunk, due to this issue)
        to have two ways of doing the same thing?

        Logically, I don't think a Directory should take the responsibility of
        deleting old files (including old lock files). It should be a clean
        interface for doing so, but I think the IndexWriter alone should be
        the class that deletes files from the directory.

        With lockless commits this has become an important difference, ie, the
        new IndexFileDeleter class (used by IndexWriter) handles re-trying
        files that are in-use (on Windows) whereas FSDirectory will throw an
        exception if create=true and there are index files are in use.

        I think we should deprecate the "create" argument to
        FSDirectory.getDirectory and leave only the create argument in
        IndexWriter's constructors. Am I missing something? Is there are a
        reason not to do this?

        Show
        Michael McCandless added a comment - Actually, this reminds me that, as of lockless commits, there is one important tradeoff on which "create=true" to use (the case on windows where you want to re-create the index but readers are currently using it). I will call out this difference in the javadocs. Although, why do we even have a "create" parameter in the directory? I think it's confusing (and dangerous, pre-trunk, due to this issue) to have two ways of doing the same thing? Logically, I don't think a Directory should take the responsibility of deleting old files (including old lock files). It should be a clean interface for doing so, but I think the IndexWriter alone should be the class that deletes files from the directory. With lockless commits this has become an important difference, ie, the new IndexFileDeleter class (used by IndexWriter) handles re-trying files that are in-use (on Windows) whereas FSDirectory will throw an exception if create=true and there are index files are in use. I think we should deprecate the "create" argument to FSDirectory.getDirectory and leave only the create argument in IndexWriter's constructors. Am I missing something? Is there are a reason not to do this?
        Hide
        Michael McCandless added a comment -

        Resolving this now, finally (I'll move the two "create" arguments
        discussion to a separate issue if we decide to go forward with that):

        • Fixed (in the trunk) to catch boundary cases of incorrect docNum's
          to deleteDocuments.
        • Fixed (in the trunk) to do earlier "IllegalState" checks to catch
          index corruption sooner. Also fixed the existing IllegalState
          check to catch missing boundary cases.
        • The re-using of old _XXX.del files is already fixed with lockless
          commits (in trunk). This remains open for past releases, but the
          workaround is simple and I've now publicized this on java-user.
        Show
        Michael McCandless added a comment - Resolving this now, finally (I'll move the two "create" arguments discussion to a separate issue if we decide to go forward with that): Fixed (in the trunk) to catch boundary cases of incorrect docNum's to deleteDocuments. Fixed (in the trunk) to do earlier "IllegalState" checks to catch index corruption sooner. Also fixed the existing IllegalState check to catch missing boundary cases. The re-using of old _XXX.del files is already fixed with lockless commits (in trunk). This remains open for past releases, but the workaround is simple and I've now publicized this on java-user.
        Hide
        Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            legez
          • Votes:
            5 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development