Issue Details (XML | Word | Printable)

Key: LUCENE-140
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Michael McCandless
Reporter: legez
Votes: 5
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

docs out of order

Created: 07/Oct/03 08:05 PM   Updated: 27/Feb/07 06:10 PM
Return to search
Component/s: Index
Affects Version/s: unspecified
Fix Version/s: 2.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File bug23650.txt 2005-06-14 11:46 PM Arvind Srinivasan 2 kB
File Licensed for inclusion in ASF works corrupted.part1.rar 2006-01-21 07:41 AM Jarrod Cuzens 9.00 MB
File Licensed for inclusion in ASF works corrupted.part2.rar 2006-01-21 07:44 AM Jarrod Cuzens 3.35 MB
Text File indexing-failure.log 2007-01-10 12:50 AM Jed Wesley-Smith 2.75 MB
Text File Licensed for inclusion in ASF works LUCENE-140-2007-01-09-instrumentation.patch 2007-01-09 02:49 PM Michael McCandless 6 kB
Environment:
Operating System: Linux
Platform: PC

Bugzilla Id: 23650
Resolution Date: 11/Jan/07 12:14 PM


 Description  « Hide
Hello,
I can not find out, why (and what) it is happening all the time. I got an
exception:
java.lang.IllegalStateException: docs out of order
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
at Optimize.main(Optimize.java:29)

It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find
it neither in download nor in version list in this form). Everything seems OK. I
can search through index, but I can not optimize it. Even worse after this
exception every time I add new documents and close IndexWriter new segments is
created! I think it has all documents added before, because of its size.

My index is quite big: 500.000 docs, about 5gb of index directory.

It is repeatable. I drop index, reindex everything. Afterwards I add a few
docs, try to optimize and receive above exception.

My documents' structure is:
static Document indexIt(String id_strony, Reader reader, String data_wydania,
String id_wydania, String id_gazety, String data_wstawienia)
{
Document doc = new Document();

doc.add(Field.Keyword("id", id_strony ));
doc.add(Field.Keyword("data_wydania", data_wydania));
doc.add(Field.Keyword("id_wydania", id_wydania));
doc.add(Field.Text("id_gazety", id_gazety));
doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
doc.add(Field.Text("tresc", reader));

return doc;
}

Sincerely,
legez



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Otis Gospodnetic added a comment - 25/Nov/03 07:35 PM
Since the bug submitter hasn't followed up to this 'repeatable' issue in almost
two months, I shall assume that this was not a bug in Lucene, but a misuse of
Lucene.

There is also no self-contained code that demonstrates the bug.

If this is indeed a Lucene bug, please re-open this bug entry, and provise
self-sufficient unit test that demonstrates this problem.


Arvind Srinivasan added a comment - 14/Jun/05 11:42 PM
More Data Integrity Issue: Docs out of Order

Hi,
Seeing similar issue to the one reported in:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650
On examining the segments, following inconsistencies were found
(a) The merging segments had doc number that is greater than maxDoc.
Don't know how it go in this state, but this occurs using standard lucene
code.
(b) Strangely, some documents had terms with zero frequency. And when it
occurred,
the zero frequency term has several posting as (docid 0)
Example.. (docid freq) – MaxDoc = 7749 - NO DELETION.
Merging msgBody; text=it; sz=2 — The field name is msgBody and term is "it"
and two segments have the term.
(0 0)(0 0)(0 0)..........(0 0)(4 6)(5 3)(6 1)(9 1)(10 2)(12 1)......
...(6791 2)(6794 3)(6796 2)(6798 16)(6801 1)(6805 1)(6806 5)
(6808 1)(6810 1)(6815 2)(6816 3)(6817 1)(6818 1)(6821 4)(6822 1)
(6824 3)(6826 4)(6828 1)(6829 3)(12549 2)doc exceeds count
749(13570 1)doc exceeds count 7749(14896 1)doc exceeds count 7749
(15028 1)doc exceeds count 7749(15357 1)doc exceeds count 7749
(15427 1)doc exceeds count 7749(15534 1)doc exceeds count 7749
(15535 1)doc exceeds count 7749(15653 1)doc exceeds count 7749
(16530 1)doc exceeds count 7749(17108 1).......
(c) Also the zero frequency was not limited to the 0 document, there was
another instance.

One work around that seemed to resolve the issue was to:
(a) keep the maxDoc as a member variable in SegmentMergeInfo
and ignore/throw exception if an inconsistent state is detected.

****ADD To SegmentMerger just before "docs out of order" check.
if (postings.freq() == 0) { continue; }
if (doc >= smi.maxDoc) { //sbLog.append("doc exceeds count \r\n " + smi.maxDoc); continue; }
****

Atleast putting a check would not corrupt the segments and would
get us closer to the real problem as to why freq=0 and doc number exceeds
maxDoc. Note, the code has had the fix to the other Segment corruption issue
that I previously reported (namely, Using a zero length file).


Arvind Srinivasan added a comment - 14/Jun/05 11:46 PM
Created an attachment (id=15405)
Analysis of corrupted segments and suggestions.

Otis Gospodnetic added a comment - 22/Jun/05 12:04 PM
Arvind, thanks for following up on this issue. From your report I can see that
the index really ends up containing invalid data, but I've never even seen this
happen myself. Could you please put together sample code that gets the index in
this state?

Yonik Seeley added a comment - 10/Nov/05 11:59 AM
I've never seen this... can anyone reproduce with Lucene 1.9?
CCing this to Arvind's email...

Otis Gospodnetic added a comment - 10/Nov/05 12:05 PM
2 years later, I still haven't seen this error.

Jarrod Cuzens added a comment - 10/Jan/06 04:30 PM
First I think that Lucene is great and it manages to do an incredible job. However, this issue is also causing us significant problems. We originally had an AOP interceptor that would update documents in our HTTP threads however when we started to see this issue we were concerned that it was caused by multiple threads accessing the index at the same time. We put extra concurrency controls on the updates using the LuceneIndexAccess API that was posted into bugzilla by another user. This issue still remained after we added the extra concurrency control.

Since then we have abandoned the AOP approach completely and moved the index rebuild to a scheduled thread which collects things that were modified from the database (using datestamp versioning) and then proceeds to update their lucene indexes. We hoped this would solve the problem because only 1 thread in 1 process would be modifying the index at any given time. Alas, we are still getting docs out of order exceptions. It is difficult for us to reproduce as it mainly happens in production and we cannot provide a testcase for it (I wish we could!).

I know that this must be a difficult issue because it is difficult to reproduce. I can't think of anything strange that we are doing with the indexes (one process, one thread modifying). This probably isn't much help but I just wanted to let you know that we are also experiencing the problem.


Doug Cutting added a comment - 11/Jan/06 02:28 AM
File corruption could cause this. Please look in your system logs to see if there are any reports of problems accessing the drive that stores this index.

Jarrod Cuzens added a comment - 17/Jan/06 02:45 AM
Hi Doug,

Thanks for your suggestion (and great work on Lucene!). I looked in the logs and could not find any indication of corruption. In addition we actually have the system running in a cluster where each node in the cluster has its own lucene search index. We had the issue on a different machine before we moved to the cluster and we now see it on both machines in our cluster . Next time it occurs I will get the indexes and try to post them here if you think that would be helpful.

Just some additional info:
1) We are running SUSE 10 Linux.
2) We run two quartz jobs: One job runs every 2 minutes and updates lucene with changes from the db if necessary. The other job runs at 3:30AM in the morning and does full index rebuilds. We use the LuceneIndexAccess api when working with the IndexReader/IndexWriter. Only one thread should modify the index at any time although it is possible that the full rebuild job and the update job execute concurrently they shouldn't both modify the index due to the LuceneIndexAccess API's synchronization.
3) Currently when doing searches we create a new IndexSearcher everytime a search is to be performed; We do not use the LuceneIndexAccess API.

When I refer to the LuceneIndexAccess API I am referring to the contribution by Mark Schreiber:
http://www.blizzy.de/lucene/lucene-indexaccess-0.1.0.zip

Thanks for any help!


Rafal Majewski added a comment - 19/Jan/06 05:57 AM
Hi,

I have recently experienced the same problem on relase 1.4.3. It happended in production(more than once). Filesystem index directory is used. One application is accountable for indexing, another for searching the index(it also uses IndexReader for query rewriting). Access to index directory(which is singleton) is synchronized within each application's scope. These two applications create separate directory instances and access it independently. Unfortunately, the code doesn't lead to repeatable occurances of this exception. I assume that these two applications having not synchronized access to the index might couse the problem, but I have received information that it had also occured with the searching application being down.

Thanks in advance for any help.


Jarrod Cuzens added a comment - 21/Jan/06 07:41 AM
I am posting our corrupted index (I have to do it in two parts because it is 14.5M). I looked at it in Luke but Luke doesn't really have any tools to help really diagnose corruption issues. At this point we are considering changing our system to do builds on one machine, test them, and then have them distributed to the other machines in our cluster.

If anybody could look at this it would be greatly appreciated!
Thanks!
Jarrod


Jarrod Cuzens added a comment - 21/Jan/06 07:44 AM
Second part.

Ales Justin added a comment - 13/Apr/06 04:13 PM
We used Lucene 1.4.3. for a half year now.
But just out of the sudden this issue appeared.

We have synchronized access to index - synchronized singleton handling all write / read operations.

Is there any progress on finding the cause?


Jason Lambert added a comment - 28/Apr/06 01:49 AM
I was having this problem intermittently while indexing over multiple threads and I have found that the following steps can cause this error (with Lucene 1.3 and 1.4.x):
  • Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index)
  • Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers
  • Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads)
  • Close the IndexWriter
  • Open another IndexReader (#2) over the index
  • Delete the previously found documents by their document ID numbers using reader #2
  • Close the #2 reader
  • Create another IndexWriter (#2) and re-add the updated documents
  • Close the IndexWriter #2
  • Close the original IndexReader (#1) and open a new reader for general searching

If I ensure that the steps marked with an asterisk do not happen (i.e. using thread synchronization), I never get this error. Otherwise, it will happen intermittently while closing the second IndexWriter (#2) (sometimes I get an ArrayIndexOutOfBoundsException during the deletion). These 'extra' writes cause the initial 'segments' file to be updated after which it is re-read while opening the second IndexReader (#2).

This can end up deleting documents using possibly non-existent IDs, most likely causing the index corruption that this error signals.


Jed Wesley-Smith added a comment - 15/Dec/06 04:11 AM
We have seen this one as well. We don't have the same usage as above, we only ever delete documents with IndexReader.deleteDocuments(Term)

We are using Lucene 1.9.1

It occurs in two places, inside IndexWriter.addDocument():

java.lang.IllegalStateException: docs out of order
at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method)
at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method)
at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658)
at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646)
at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453)
at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436)

and inside IndexWriter.close():

java.lang.IllegalStateException: docs out of order
at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method)
at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method)
at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658)
at org.apache.lucene.index.IndexWriter.flushRamSegments()V(IndexWriter.java:628)
at org.apache.lucene.index.IndexWriter.close()V(IndexWriter.java:375)

The second one exposes a problem in the close() method which is that the index write.lock is not released when exceptions are thrown in close() causing subsequent attempts to open an IndexWriter to fail.


Michael McCandless added a comment - 18/Dec/06 08:57 PM
I just resolved LUCENE-603 as a dup of this issue.

It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!!


Jed Wesley-Smith added a comment - 08/Jan/07 05:42 AM
We have now seen this in a number of customer sites since upgrading JIRA to use Lucene 1.9.1. The JIRA report is here: http://jira.atlassian.com/browse/JRA-11861

We only seem to have seen it since the upgrade from 1.4.3 to 1.9.1, we hadn't seen it before then.

This is now a major issue for us, it is hitting a number of our customers. I am trying to generate a repeatable test for it as a matter of urgency.

As a follow-up we sometimes see the old ArrayIndexOutOfBoundsEx in BitVector.get() (BitVector.java:63)

will post more if I find something worth sharing.


Jed Wesley-Smith added a comment - 08/Jan/07 05:44 AM
and we also see ArrayIndexOutOfBoundsEx in the SegmentReader.isDeleted() method:

java.lang.ArrayIndexOutOfBoundsException
at org.apache.lucene.index.SegmentReader.isDeleted(I)Z(Optimized Method)
at org.apache.lucene.index.SegmentMerger.mergeFields()I(Optimized Method)
at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method)
at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)


Michael McCandless added a comment - 08/Jan/07 01:09 PM
OK: I finally found one way that this corruption can occur! I will
create a unit test & commit a fix.

If you delete by document number, and, that document number is larger
than maxDoc, but only by a little, then the call to
deletedDocs.set(num) may in fact succeed (ie, no exception), but will
have set bits that are "out of bounds" in the BitVector's bits array.

This is because the bits array is an array of bytes and so you can
have up to 7 of these unused bits at the end. Once this has happened,
any attempt to merge this segment will hit the "docs out of order"
exception because the BitVector's count() method will count these
"illegally set" bits and thus make the SegmentMerger think too many
docs are deleted.

Unfortunately, this case only occurs if you use deleteDocument(int),
so I can't yet explain how this happens when using only
deleteDocument(Term).


Michael McCandless added a comment - 08/Jan/07 06:34 PM

I've committed a fix for this one case to the trunk.

I'm leaving the issue open so folks above can try the fix and confirm
whether or not this fixes their cases.

Jed (or any other folks who have hit this above and are still
listening!), the fix is really trivial and would be easy to back
port to prior releases: just run "svn diff -r 494135:494136" from
a Lucene checkout to see them.

If you are willing/able to try this in one of the environments where
you keep hitting this issue, that would be awesome: if this is in fact
your root cause, then you would see an ArrayIndexOutOfBoundsException
at the point that the delete of a too-large docNum occurred (rather
than silent corruption and the above exception much later that you now
see); and if it's not your root cause after testing the fix, then we
would know for sure to look for another cause here.

Are you sure that you only ever do IndexReader.deleteDocuments(Term)
and not deleteDocuments(int docNum)? I still can't explain how this
error could happen without using that second method.


Doron Cohen added a comment - 09/Jan/07 01:27 AM
Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things -

(1) the sequence of ops brought by Jason is wrong:
a Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index)
b Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers
c Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads)
d Close the IndexWriter
e Open another IndexReader (#2) over the index
f Delete the previously found documents by their document ID numbers using reader #2
g Close the #2 reader
h Create another IndexWriter (#2) and re-add the updated documents
i Close the IndexWriter #2
j Close the original IndexReader (#1) and open a new reader for general searching

Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case.

(2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good.
However when I first ran into this I didn't notice that BitVector.size() would become wrong as result of this - nice catch Mike!

I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the "gotException assert" - if you comment this assert the test pass.

The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it.

public void testOutOfOrder () throws IOException {
String tempDir = System.getProperty("java.io.tmpdir");
if (tempDir == null) { throw new IOException("java.io.tmpdir undefined, cannot run test: "+getName()); }

File indexDir = new File(tempDir, "lucenetestindexTemp");
Directory dir = FSDirectory.getDirectory(indexDir, true);

boolean create = true;
int numDocs = 0;
int maxDoc = 0;
while (numDocs < 100) {
IndexWriter iw = new IndexWriter(dir,anlzr,create);
create = false;
iw.setUseCompoundFile(false);
for (int i=0; i<2; i++) { Document d = new Document(); d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED)); iw.addDocument(d); }
iw.optimize();
iw.close();
IndexReader ir = IndexReader.open(dir);
numDocs = ir.numDocs();
maxDoc = ir.maxDoc();
assertEquals(numDocs,maxDoc);
for (int i=7; i >=1; i-) {
try { ir.deleteDocument(maxDoc+i); } catch (ArrayIndexOutOfBoundsException e) {
}
}
ir.close();
}
}

Mike, do you agree?

(3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit).

This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657

Not sure if this can be the case here but at least we can discuss whether it is better to always store the length.


Jed Wesley-Smith added a comment - 09/Jan/07 06:38 AM
Hi Michael,

This is awesome, I have prepared a patched 1.9.1: http://jira.atlassian.com/secure/attachment/19390/lucene-core-1.9.1-atlassian-patched-2007-01-09.jar

Unfortunately we don't have a repeatable test for this so we will have to distribute to afflicted customers and - well, pray I guess. We have been seeing this sporadically in our main JIRA instance http://jira.atlassian.com so we will hopefully not observe it now.

We do only use the deleteDocuments(Term) method, so we are not sure whether this will truly fix our problem, but we note that that method calls deleteDocument(int) based on the TermDocs returned for the Term - and maybe they can be incorrect???

Out of interest, apart from changing from 1.4.3 to 1.9.1, in the JIRA 3.7 release we changed our default merge factor to 4 from 10. We hadn't seen this problem before, and suddenly we have had a reasonable number of occurrences.


Jed Wesley-Smith added a comment - 09/Jan/07 06:41 AM
Alas, this doesn't appear to be the problem. We are still getting it, but we do at least have a little more info. We added the doc and lastDoc to the IllegalArgEx and we are getting very strange numbers:

java.lang.IllegalStateException: docs out of order (-1764 < 0)
at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(SegmentMerger.java:335)
at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(SegmentMerger.java:298)
at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(SegmentMerger.java:272)
at org.apache.lucene.index.SegmentMerger.mergeTerms()V(SegmentMerger.java:236)
at org.apache.lucene.index.SegmentMerger.merge()I(SegmentMerger.java:89)
at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658)
at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646)
at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453)
at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436)

where doc = -1764 and lastDoc is zero


Michael McCandless added a comment - 09/Jan/07 11:18 AM
Jed, thanks for testing the fix!

> Alas, this doesn't appear to be the problem. We are still getting
> it, but we do at least have a little more info. We added the doc and
> lastDoc to the IllegalArgEx and we are getting very strange numbers:
>
> java.lang.IllegalStateException: docs out of order (-1764 < 0)
> ...
>
> where doc = -1764 and lastDoc is zero

OK, so we've definitely got something else at play here (bummer!). That
(negative doc number) is very strange. I will keep looking a this. I
will prepare a patch on 1.9.1 that adds some more instrumentation so
we can get more details when you hit this ...

> We do only use the deleteDocuments(Term) method, so we are not sure
> whether this will truly fix our problem, but we note that that
> method calls deleteDocument(int) based on the TermDocs returned for
> the Term - and maybe they can be incorrect???

Right, but I had thought the docNum's coming in from this path would
be correct. It sounds like we have another issue at play here that
can somehow get even these doc numbers messed up.

> Out of interest, apart from changing from 1.4.3 to 1.9.1, in the
> JIRA 3.7 release we changed our default merge factor to 4 from
> 10. We hadn't seen this problem before, and suddenly we have had a
> reasonable number of occurrences.

Interesting. Maybe try changing back to 4 and see if it suppresses
the bug? Might give us [desperately needed] more data to cling to
here! On the 1.4.3 -> 1.9.1 change, some of the cases above were even
pre-1.4.x (though they could have been from yet another root cause or
maybe filesystem) so it's hard to draw hard conclusions on this
front.


Michael McCandless added a comment - 09/Jan/07 11:32 AM

Doron,

> (1) the sequence of ops brought by Jason is wrong:
> ...
>
> Problem here is that the docIDs found in (b) may be altered in step
> (d) and so step (f) would delete the wrong docs. In particular, it
> might attempt to delete ids that are out of the range. This might
> expose exactly the BitVector problem, and would explain the whole
> thing, but I too cannot see how it explains the delete-by-term case.

Right, the case I fixed only happens when the Lucene
deleteDocument(int docNum) is [slightly] mis-used. Ie if you are
"playing by the rules" you would never have hit this bug. And this
particular use case is indeed incorrect: doc numbers are only valid to
the one reader that you got them from.

> I think however that the test Mike added does not expose the docs
> out of order bug - I tried this test without the fix and it only
> fail on the "gotException assert" - if you comment this assert the
> test pass.

Huh, I see my test case (in IndexReader) indeed hitting the original
"docs out of order" exception. If I take the current trunk and
comment out the (one line) bounds check in BitVector.set and run that
test, it hits the "docs out of order" exception.

Are you sure you updated the change (to tighten the check to a <= from
a <) to index/SegmentMerger.java? Because, I did indeed find that the
test failed to fail when I first wrote it (but should have). So in
digging why it didn't fail as expected, I found that the check for
"docs out of order" missed the boundary case of the same doc number
twice in a row. Once I fixed that, the test failed as expected.

> (3) maxDoc() computation in SegmentReader is based (on some paths)
> in RandomAccessFile.length(). IIRC I saw cases (in previous project)
> where File.length() or RAF.length() (not sure which of the two) did
> not always reflect real length, if the system was very busy IO wise,
> unless FD.sync() was called (with performance hit).

Yes I saw this too. From the follow-on discussion it sounds like we
haven't found a specific known JVM bug here. Still, it does make me
nervous that we rely on file length to derive maxDoc.

In general I think we should rely on as little as possible from the
file system (there are so many cross platform issues/differences) and
instead explicitly store things like maxDoc into the index. I will
open a separate Jira issue to track this. Also I will record this
path in the instrumentation patch for 1.9.1 just to see if we are
actually hitting something here (I think unlikely but possible).


Michael McCandless added a comment - 09/Jan/07 11:40 AM
OK, I created LUCENE-767 for the "maxDoc should be explicitly stored in the index" issue.

Michael McCandless added a comment - 09/Jan/07 02:50 PM

Jed, one question: when you tested the fix, you fully rebuilt your
index from scratch, right? Just want to verify that. You have to
re-index because once the index is corrupted it will eventually hit
the "docs out of order" exception even if you fix the original cause.

OK I've prepared a patch off 1.9.1 (just attached it). The patch
passes all unit tests on 1.9.1.

It has the changes I committed to the trunk yesterday, plus
instrumentation (messages printed to a PrintStream) to catch places
where doc numbers are not correct.

All messages I added print to a newly added infoStream static member
of SegmentMerger. You can do SegmentMerger.setInfoStream(...) to
change it (it defaults to System.err).

Jed if you could get the error to re-occur with this patch and then
post the resulting messages, that would be great. Hopefully it gives
us enough information to find the source here or at least to have
another iteration with yet more instrumentation. Thanks!


Jed Wesley-Smith added a comment - 10/Jan/07 12:57 AM
Hi Michael,

Thanks for the patch, applied and recreated. Attached is the log.

To be explicit, we are recreating the index via the IndexWriter ctor with the create flag set and then completely rebuilding the index. We are not completely deleting the entire directory. There ARE old index files (_.cfs & _.del) in the directory with updated timestamps that are months old. If I completely recreate the directory the problem does go away. This is a fairly trivial "fix", but we are still investigating as we want to know if this is indeed the problem, how we have come to make it prevalent, and what the root cause is.

Thanks for all the help everyone.


Jed Wesley-Smith added a comment - 10/Jan/07 06:07 AM
BTW. We have looked at all the open files referenced by the VM when the indexing errors occur, and there does not seem to be any reference to the old index segment files, so I am not sure how those files are influencing this problem.

Doron Cohen added a comment - 10/Jan/07 07:36 AM
Jed, is it possible that when re-creating the index, while IndexWriter is constructed with create=true, FSDirectory is opened with create=false?
I suspect so, because otherwise, old .del files would have been deleted.
If indeed so, newly created segments, which have same names as segments in previous (bad) runs, when opened, would read the (bad) old .del file.
This would possibly expose the bug fixed by Michael.
I may be over speculating here, but if this is the case, it can also explain why changing the merge factor from 4 to 10 exposed the problem.

In fact, let me speculate even further - if indeed when creating the index from scratch, the FSDirectory is (mistakenly) opened with create=false, as long as you always repeated the same sequencing of adding and deleting docs, you were likely to almost not suffer from this mistake, because segments created with same names as (old) .del files simply see docs as deleted before the docs are actually deleted by the program. The search behaves wrongly, not finding these docs before they are actually deleted, but no exception is thrown when adding docs. However once the merge factor was changed from 4 to 10, the matching between old .del files and new segments (with same names) was broken, and the out-of-order exception appeared.

...and if this is not the case, we would need to look for something else...


Michael McCandless added a comment - 10/Jan/07 10:48 AM
OK from that indexing-failure.log (thanks Jed!) I can see that indeed
there are segments whose maxDoc() is much smaller than
deleteDocs.count(). This then leads to negative doc numbers on
merging these segments.

Jed when you say "there are old files (_.cfs & _.del) in this
directory with updated timestamps that are months old" what do you
mean by "with updated timestamps"? Which timestamp is months old and
which one is updated?

OK, assuming Jed you are indeed sending "create=false" when creating
the Directory and then passing that directory to IndexWriter with
create=true, I think we now have this case fully explained (thanks
Doron): your old _*.del files are being incorrectly opened & re-used
by Lucene, when they should not be.

Lucene (all released versions but not the trunk version, see below)
does a simple fileExists("_XXX.del") call to determine if a segment
XXX has deletes.

But when that _XXX.del is a leftover from a previous index, it very
likely doesn't "match" the newly created _XXX segment. (Especially if
merge factor has changed but also if order of operations has changed,
which I would expect in this use case).

If that file exists, Lucene assumes it's for this segment and so opens
it and uses it. If it happens that this _XXX.del file has more
documents in it than the newly created _XXX.cfs segment, then negative
doc numbers will result (and then later cause the "docs out of order"
exception). If it happens that the _XXX.del file has fewer documents
than the newly created _XXX.cfs segment then you'll hit
ArrayIndexOutOfBounds exceptions in calls to isDeleted(...). If they
are exactly equal then you'd randomly see some of your docs got
deleted.

Note that the trunk version of Lucene has already fixed this bug (as
part of lockless commits):

  • Whether a segment has deletions or not is now explictly stored in
    the segments file rather than relying on a "fileExists(...)" call.
    So, if an old _XXX.del existed in the filesystem, the newly
    created _XXX segment would not open it.
  • Furthermore, the trunk version of Lucene uses a new
    IndexFileDelter class to remove any unreferenced index files.
    This means it would have removed these old _.cfs and _.del files
    even in the case where a directory was created with "create=false"
    and the IndexWriter was created with "create=true".

To summarize:

  • There was one case where if you gave slightly illegal doc numbers
    (within 7 of the actual maxDoc) Lucene may silently accept the
    call but would corrupt your index only to be seen later as an
    "docs out of order" IllegalStateException when the segment is
    merged. This was just a missing boundary case check. This case
    is now fixed in the trunk (you get an
    ArrayIndexOutOfBoundsException if doc number is too large).
  • There is also another case, that only happens if you have old
    _*.del files leftover from a previous index while re-creating a
    new index.

The workaround is simple here: always open the Directory with
create=true (or, remove the directory contents yourself before
hand). (IndexWriter does this if you give it a String or File
with create=true).

This is really a bug in Lucene, but given that it's already fixed
in the trunk, and the workaround is simple, I'm inclined to not
fix it in prior releases and instead publicize the issue (I will
do so on java-user).

But, I will commit two additional IllegalStateException checks to
the trunk when a segment is first initialized: 1) check that the
two different sources of "maxDoc" (fieldsReader.size() and
si.docCount) are the same, and 2) check that the number of pending
deletions does not exceed maxDoc(). When an index has
inconsistency I think the earlier it's detected the better.


Michael McCandless added a comment - 10/Jan/07 07:18 PM
> BTW. We have looked at all the open files referenced by the VM when
> the indexing errors occur, and there does not seem to be any reference
> to the old index segment files, so I am not sure how those files are
> influencing this problem.

Jed just to answer this question: the _XXX.del files are opened very
briefly because the contents of this file are loaded / cached in
memory, and the the file handle is closed. I don't think the _XXX.cfs
files are affecting this issue (are not opened).


Jed Wesley-Smith added a comment - 11/Jan/07 01:18 AM
Michael, Doron, you guys are legends!

Indeed the problem is using only the IndexWriter with create true to recreate the directory. Creating a new Directory with create true does fix the problem. The javadoc for this constructor is fairly explicit that it should recreate the index for you (no caveat), so I would consider that a bug, but - given that head fixes it - not one that requires any action.

Thanks guys for the prompt attention, excellent and thorough analysis.


Michael McCandless added a comment - 11/Jan/07 11:46 AM
Phew! I'm glad we finally got to the bottom of this one.

Thank you for your persistent and fast testing iterations, Jed; this
issue has been open for far too long!

I will send a summary email to java-user and resolve this issue,
finally.


Michael McCandless added a comment - 11/Jan/07 12:09 PM

Actually, this reminds me that, as of lockless commits, there is one
important tradeoff on which "create=true" to use (the case on windows
where you want to re-create the index but readers are currently using
it). I will call out this difference in the javadocs.

Although, why do we even have a "create" parameter in the directory?
I think it's confusing (and dangerous, pre-trunk, due to this issue)
to have two ways of doing the same thing?

Logically, I don't think a Directory should take the responsibility of
deleting old files (including old lock files). It should be a clean
interface for doing so, but I think the IndexWriter alone should be
the class that deletes files from the directory.

With lockless commits this has become an important difference, ie, the
new IndexFileDeleter class (used by IndexWriter) handles re-trying
files that are in-use (on Windows) whereas FSDirectory will throw an
exception if create=true and there are index files are in use.

I think we should deprecate the "create" argument to
FSDirectory.getDirectory and leave only the create argument in
IndexWriter's constructors. Am I missing something? Is there are a
reason not to do this?


Michael McCandless added a comment - 11/Jan/07 12:14 PM
Resolving this now, finally (I'll move the two "create" arguments
discussion to a separate issue if we decide to go forward with that):
  • Fixed (in the trunk) to catch boundary cases of incorrect docNum's
    to deleteDocuments.
  • Fixed (in the trunk) to do earlier "IllegalState" checks to catch
    index corruption sooner. Also fixed the existing IllegalState
    check to catch missing boundary cases.
  • The re-using of old _XXX.del files is already fixed with lockless
    commits (in trunk). This remains open for past releases, but the
    workaround is simple and I've now publicized this on java-user.

Michael McCandless added a comment - 27/Feb/07 06:10 PM
Closing all issues that were resolved for 2.1.