|
More Data Integrity Issue: Docs out of Order
Hi, One work around that seemed to resolve the issue was to: ****ADD To SegmentMerger just before "docs out of order" check. Atleast putting a check would not corrupt the segments and would Created an attachment (id=15405)
Analysis of corrupted segments and suggestions. Arvind, thanks for following up on this issue. From your report I can see that
the index really ends up containing invalid data, but I've never even seen this happen myself. Could you please put together sample code that gets the index in this state? I've never seen this... can anyone reproduce with Lucene 1.9?
CCing this to Arvind's email... 2 years later, I still haven't seen this error.
First I think that Lucene is great and it manages to do an incredible job. However, this issue is also causing us significant problems. We originally had an AOP interceptor that would update documents in our HTTP threads however when we started to see this issue we were concerned that it was caused by multiple threads accessing the index at the same time. We put extra concurrency controls on the updates using the LuceneIndexAccess API that was posted into bugzilla by another user. This issue still remained after we added the extra concurrency control.
Since then we have abandoned the AOP approach completely and moved the index rebuild to a scheduled thread which collects things that were modified from the database (using datestamp versioning) and then proceeds to update their lucene indexes. We hoped this would solve the problem because only 1 thread in 1 process would be modifying the index at any given time. Alas, we are still getting docs out of order exceptions. It is difficult for us to reproduce as it mainly happens in production and we cannot provide a testcase for it (I wish we could!). I know that this must be a difficult issue because it is difficult to reproduce. I can't think of anything strange that we are doing with the indexes (one process, one thread modifying). This probably isn't much help but I just wanted to let you know that we are also experiencing the problem. File corruption could cause this. Please look in your system logs to see if there are any reports of problems accessing the drive that stores this index.
Hi Doug,
Thanks for your suggestion (and great work on Lucene!). I looked in the logs and could not find any indication of corruption. In addition we actually have the system running in a cluster where each node in the cluster has its own lucene search index. We had the issue on a different machine before we moved to the cluster and we now see it on both machines in our cluster Just some additional info: When I refer to the LuceneIndexAccess API I am referring to the contribution by Mark Schreiber: Thanks for any help! Hi,
I have recently experienced the same problem on relase 1.4.3. It happended in production(more than once). Filesystem index directory is used. One application is accountable for indexing, another for searching the index(it also uses IndexReader for query rewriting). Access to index directory(which is singleton) is synchronized within each application's scope. These two applications create separate directory instances and access it independently. Unfortunately, the code doesn't lead to repeatable occurances of this exception. I assume that these two applications having not synchronized access to the index might couse the problem, but I have received information that it had also occured with the searching application being down. Thanks in advance for any help. I am posting our corrupted index (I have to do it in two parts because it is 14.5M). I looked at it in Luke but Luke doesn't really have any tools to help really diagnose corruption issues. At this point we are considering changing our system to do builds on one machine, test them, and then have them distributed to the other machines in our cluster.
If anybody could look at this it would be greatly appreciated! We used Lucene 1.4.3. for a half year now.
But just out of the sudden this issue appeared. We have synchronized access to index - synchronized singleton handling all write / read operations. Is there any progress on finding the cause? I was having this problem intermittently while indexing over multiple threads and I have found that the following steps can cause this error (with Lucene 1.3 and 1.4.x):
If I ensure that the steps marked with an asterisk This can end up deleting documents using possibly non-existent IDs, most likely causing the index corruption that this error signals. We have seen this one as well. We don't have the same usage as above, we only ever delete documents with IndexReader.deleteDocuments(Term)
We are using Lucene 1.9.1 It occurs in two places, inside IndexWriter.addDocument(): java.lang.IllegalStateException: docs out of order and inside IndexWriter.close(): java.lang.IllegalStateException: docs out of order The second one exposes a problem in the close() method which is that the index write.lock is not released when exceptions are thrown in close() causing subsequent attempts to open an IndexWriter to fail. I just resolved
It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!! We have now seen this in a number of customer sites since upgrading JIRA to use Lucene 1.9.1. The JIRA report is here: http://jira.atlassian.com/browse/JRA-11861
We only seem to have seen it since the upgrade from 1.4.3 to 1.9.1, we hadn't seen it before then. This is now a major issue for us, it is hitting a number of our customers. I am trying to generate a repeatable test for it as a matter of urgency. As a follow-up we sometimes see the old ArrayIndexOutOfBoundsEx in BitVector.get() (BitVector.java:63) will post more if I find something worth sharing. and we also see ArrayIndexOutOfBoundsEx in the SegmentReader.isDeleted() method:
java.lang.ArrayIndexOutOfBoundsException OK: I finally found one way that this corruption can occur! I will
create a unit test & commit a fix. If you delete by document number, and, that document number is larger This is because the bits array is an array of bytes and so you can Unfortunately, this case only occurs if you use deleteDocument(int), I've committed a fix for this one case to the trunk. I'm leaving the issue open so folks above can try the fix and confirm Jed (or any other folks who have hit this above and are still If you are willing/able to try this in one of the environments where Are you sure that you only ever do IndexReader.deleteDocuments(Term) Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things -
(1) the sequence of ops brought by Jason is wrong: Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case. (2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good. I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the "gotException assert" - if you comment this assert the test pass. The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it. public void testOutOfOrder () throws IOException { File indexDir = new File(tempDir, "lucenetestindexTemp"); boolean create = true; Mike, do you agree? (3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit). This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657 Not sure if this can be the case here but at least we can discuss whether it is better to always store the length. Hi Michael,
This is awesome, I have prepared a patched 1.9.1: http://jira.atlassian.com/secure/attachment/19390/lucene-core-1.9.1-atlassian-patched-2007-01-09.jar Unfortunately we don't have a repeatable test for this so we will have to distribute to afflicted customers and - well, pray I guess. We have been seeing this sporadically in our main JIRA instance http://jira.atlassian.com We do only use the deleteDocuments(Term) method, so we are not sure whether this will truly fix our problem, but we note that that method calls deleteDocument(int) based on the TermDocs returned for the Term - and maybe they can be incorrect??? Out of interest, apart from changing from 1.4.3 to 1.9.1, in the JIRA 3.7 release we changed our default merge factor to 4 from 10. We hadn't seen this problem before, and suddenly we have had a reasonable number of occurrences. Alas, this doesn't appear to be the problem. We are still getting it, but we do at least have a little more info. We added the doc and lastDoc to the IllegalArgEx and we are getting very strange numbers:
java.lang.IllegalStateException: docs out of order (-1764 < 0) where doc = -1764 and lastDoc is zero Jed, thanks for testing the fix!
> Alas, this doesn't appear to be the problem. We are still getting OK, so we've definitely got something else at play here (bummer!). That > We do only use the deleteDocuments(Term) method, so we are not sure Right, but I had thought the docNum's coming in from this path would > Out of interest, apart from changing from 1.4.3 to 1.9.1, in the Interesting. Maybe try changing back to 4 and see if it suppresses Doron, > (1) the sequence of ops brought by Jason is wrong: Right, the case I fixed only happens when the Lucene > I think however that the test Mike added does not expose the docs Huh, I see my test case (in IndexReader) indeed hitting the original Are you sure you updated the change (to tighten the check to a <= from > (3) maxDoc() computation in SegmentReader is based (on some paths) Yes I saw this too. From the follow-on discussion it sounds like we In general I think we should rely on as little as possible from the OK, I created
Jed, one question: when you tested the fix, you fully rebuilt your OK I've prepared a patch off 1.9.1 (just attached it). The patch It has the changes I committed to the trunk yesterday, plus All messages I added print to a newly added infoStream static member Jed if you could get the error to re-occur with this patch and then Hi Michael,
Thanks for the patch, applied and recreated. Attached is the log. To be explicit, we are recreating the index via the IndexWriter ctor with the create flag set and then completely rebuilding the index. We are not completely deleting the entire directory. There ARE old index files (_.cfs & _.del) in the directory with updated timestamps that are months old. If I completely recreate the directory the problem does go away. This is a fairly trivial "fix", but we are still investigating as we want to know if this is indeed the problem, how we have come to make it prevalent, and what the root cause is. Thanks for all the help everyone. BTW. We have looked at all the open files referenced by the VM when the indexing errors occur, and there does not seem to be any reference to the old index segment files, so I am not sure how those files are influencing this problem.
Jed, is it possible that when re-creating the index, while IndexWriter is constructed with create=true, FSDirectory is opened with create=false?
I suspect so, because otherwise, old .del files would have been deleted. If indeed so, newly created segments, which have same names as segments in previous (bad) runs, when opened, would read the (bad) old .del file. This would possibly expose the bug fixed by Michael. I may be over speculating here, but if this is the case, it can also explain why changing the merge factor from 4 to 10 exposed the problem. In fact, let me speculate even further - if indeed when creating the index from scratch, the FSDirectory is (mistakenly) opened with create=false, as long as you always repeated the same sequencing of adding and deleting docs, you were likely to almost not suffer from this mistake, because segments created with same names as (old) .del files simply see docs as deleted before the docs are actually deleted by the program. The search behaves wrongly, not finding these docs before they are actually deleted, but no exception is thrown when adding docs. However once the merge factor was changed from 4 to 10, the matching between old .del files and new segments (with same names) was broken, and the out-of-order exception appeared. ...and if this is not the case, we would need to look for something else... OK from that indexing-failure.log (thanks Jed!) I can see that indeed
there are segments whose maxDoc() is much smaller than deleteDocs.count(). This then leads to negative doc numbers on merging these segments. Jed when you say "there are old files (_.cfs & _.del) in this OK, assuming Jed you are indeed sending "create=false" when creating Lucene (all released versions but not the trunk version, see below) But when that _XXX.del is a leftover from a previous index, it very If that file exists, Lucene assumes it's for this segment and so opens Note that the trunk version of Lucene has already fixed this bug (as
To summarize:
The workaround is simple here: always open the Directory with This is really a bug in Lucene, but given that it's already fixed But, I will commit two additional IllegalStateException checks to > BTW. We have looked at all the open files referenced by the VM when
> the indexing errors occur, and there does not seem to be any reference > to the old index segment files, so I am not sure how those files are > influencing this problem. Jed just to answer this question: the _XXX.del files are opened very Michael, Doron, you guys are legends!
Indeed the problem is using only the IndexWriter with create true to recreate the directory. Creating a new Directory with create true does fix the problem. The javadoc for this constructor is fairly explicit that it should recreate the index for you (no caveat), so I would consider that a bug, but - given that head fixes it - not one that requires any action. Thanks guys for the prompt attention, excellent and thorough analysis. Phew! I'm glad we finally got to the bottom of this one.
Thank you for your persistent and fast testing iterations, Jed; this I will send a summary email to java-user and resolve this issue, Actually, this reminds me that, as of lockless commits, there is one Although, why do we even have a "create" parameter in the directory? Logically, I don't think a Directory should take the responsibility of With lockless commits this has become an important difference, ie, the I think we should deprecate the "create" argument to Resolving this now, finally (I'll move the two "create" arguments
discussion to a separate issue if we decide to go forward with that):
Closing all issues that were resolved for 2.1.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
two months, I shall assume that this was not a bug in Lucene, but a misuse of
Lucene.
There is also no self-contained code that demonstrates the bug.
If this is indeed a Lucene bug, please re-open this bug entry, and provise
self-sufficient unit test that demonstrates this problem.