|
It may make more sense to trap "Access Denied" in the lock.obtain, but then translate this into "the lock was not acquired" (ie, just return 0). Because, above this code is the retry logic for the lock (which pauses by default for 1.0 sec).
I'm having trouble reproducing this issue. I copied the TestInterleavedAddAndRemoves.java into src/test/org/apache/lucene/index, then ran the test directly using "java org.junit.runner.JUnitCore org.apache.lucene.index.TestInterleavedAddAndRemoves", using a clean checkout of the current Lucene HEAD. The test is still running and is quite far along and I haven't hit any of the above errors.
I'm running on Windows XP SP2, Sun JDK 1.5.0_07. I wonder if SP1 vs SP2 makes the difference? Could you also try [temporarily] turning off any virus / malware scanning tools? I wonder if you have one that's doing "live" checking and hold files open? (Though, I have a virus scanner running and it's not causing problems...). I would like to reproduce this so I could test it against my fixes for lock-less commits! > just to confirm, is it the COMMIT lock that's throwing these
> unhandled exceptions (not the WRITE lock)? > If so, lockless commits would fix this. In my tests so far, these errors appeared only for commit locks. However I consider this a coincidence - there is nothing as far as I can understand special with commit locks comparing to write locks - in particular they both use createNewFile. So, I agree that lockless commits would prevent this, which is good, but we cannot count on that it would not happen for write locks as well. Also, the more I think about it the more I like lock-less commits, still, they would take a while to get into Lucene, while this simple fix can help easily now. Last, with lock-less commits, still, there would be calls to createNewFile for write lock, and there would be calls to renameFile() and other IO file operations, intensively. By having a safety code like the retry logic that is invoked only in rare cases of these unexpected, some nasty errors would be reduced, more users would be happy. > Can you provide more details on the exceptions you're seeing? Here is one from my run log, that occurs at the call to optimize, after at the end of all the add-remove iterations - [junit] java.io.IOException: Cannot rename C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deleteable.new to C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deletable This exception btw is from the performance test for interleaved-adds-and-removes - issue 565 - so IndexWriter line numbers here relate to applying recent patch from issue 565 (though the same errors are obtained with the svn head of IndexWriter). > It may make more sense to trap "Access Denied" in the lock.obtain, It is true that when the lock cannot be obtained the existing retry logic in Lock.java could handle it. But when you come to think of it, this is not the purpose of that Lock retry logic - that was for the case that the lock is really acquired by someone else, and we want to stay around for a while to try again. This is not the case here, although the symptoms are similar. Masking this error would not be a good idea. I think it is better for the code in FSDirectory to throw the exception if the retry fails as well (as currently in this patch), and let Lock.java apply its retry logic also for an IOException. If again, the retry of Lock class fails, it would be again problematic to hide the exception. > I'm having trouble reproducing this issue. I copied the I'm not sure here. I am also running with svn head. I am trying again now, after I turned off anti-virus, and disabled Windows indexing (though the service was already off), and disabled an afs client service that was running. I will report here if the errors happen again. But I am not sure how this should affect decision on applying this fix - there would always be user machines out there running Lucene and also running other services. We could tell users - hey, make sure that none of the other services / software running on your machine is holding / touching / examining Lucene index files, otherwise, don't blame Lucene - but this is not easily done. Not all developers out there have control or understanding of what's running on their machines - some programs are installed by a system support, you know how it is. So, while it is understandable that Lucene would fail if there is a malicious software that actually grabs and holds Lucene files and interfere with them (for "long" periods of times), it would be nice to keep these failures at minimum. > I would like to reproduce this so I could test it against my fixes for lock-less commits! The performance test case for 565 is a more aggressive test in this regard - it produced more of these errors for me, including rename() errors. To run it, apply the most recent patch from http://issues.apache.org/jira/browse/LUCENE-565 Stopping the anti-virus and its friends did not matter - still getting the errors.
However saw a case that the 30ms did not suffice for obtaining the lock in the retry. Although 30ms was arbitrary in the first place, this is discouraging. This was before fixing to let Lock.obtain() apply its retry logic in case of such an exception. So I fixed that (Lock.obtain()) and re-running, now using 100ms instead of 30ms for the one retry in FSDirectory. Ain't life fun. A single retry in Lock.obtain() makes the error less likely, but certainly not impossible... the second attempt could fail for the same reason.
obtain() is supposed to return success or failure immediately. I'd be tempted to override obtain(timout) for FS locks and keep the retry logic there. I agree we don't want to mask all IOExceptions and treat them as failure to aquire locks... they should bubble up sooner or later to help diagnose real IOExceptions. > obtain() is supposed to return success or failure immediately.
> I'd be tempted to override obtain(timout) for FS locks and keep the retry logic there. Right, this is the right place for the retry. This way changes are limited to FSDirectory, and obtain() remains unchanged. I am tesing this now and would subit an updated patch, where:
I am attaching an updated patch - FSDirs_Retry_Logic_3.patch.
In this update:
"ant test" tests all pass. > But I am not sure how this should affect decision on applying this fix > We could tell users - hey, make sure that none of the other services / > So, while it is understandable that Lucene would fail if there is a Alas I still cannot reproduce this. I think there must be some I agree, Lucene should strive to be robust to the various For example, if it turns out this happens only under Windows XP SP1, Given that we have two environments, one very reliably showing these OS: Windows XP Pro, SP2 I think I know which software is causing/exposing this behavior in my environment.
This is the SVN client I am using - TortoiseSVN. I tried the following sequence: I am using most recent stable TotoiseSVN version - 1.3.5 build 6804 - 32 bit, for svn-1.3.2, downloaded from http://tortoisesvn.tigris.org/ There is an interesting discussion thread of these type of errors on Windows platforms in svn forums - http://svn.haxx.se/dev/archive-2003-10/0136.shtml It says "...Windows allows applications to "tag-along" to see when a file has been written - they will wait for it to close and then do whatever they do, usually opening a file descriptor or handle. This would prevent that file from being renamed for a brief period..." TortoiseSVN is a shell extension integrated into Windows explorer. As such, it probably demonstrates the "tag-along" behavior described above. (BTW, it is a great svn client to my opinion) Here is another excerpt from that discussion thread - We don't know that this is a bug in TortoiseSVN. I guess the question is - is it worth for Lucene to attempt to at least reduce chances of failures in this case (I say yes Wow! Fantastic sleuthing. I never would have guessed that.
I just sent this summary of this to java-user: There is an issue opened on Lucene: http://issues.apache.org/jira/browse/LUCENE-665 that I'd like to draw your attention to and summarize here because The gist of the issue is: on Windows, you sometimes see intermittant I know there was at least one recent thread where someone was hitting Anyway, at the end of the issue it was discovered that there was an Unfortunately, there are apparently many software packages that use But the bottom line is: if you hit these "Access Denied" errors, one I do think we should make Lucene robust to "windows change log"
software. We could take the position that you have to uninstall such software I would put this under the "Lucene should assume the least common I will try to reproduce this bug with my [upcoming] changes for My summary - and "what's next" proposal - for the discussion so far (in comments for issue-665 and in thread http://www.nabble.com/-jira--Created%3A-%28LUCENE-665%29-temporary-file-access-denied-on-Windows-tf2167540.html):
[1] Reported problem can be regenerated in Windows in presence of programs monitoring files. [2] The proposed fix adds retry after 100ms delay in rare cases where the problem occurs. [3] That fix reduces much the chances of the problem but does not really solve it. [4] Proposed fix for FSDirectry not accepted because: [5] A Windows-specific implementation of FSDir that would not be the default, but would be available for application to select, was proposed as a better place to host this retry logic, to be available for applications at least until the "lock-less" commits is available for use and proves to solve the same problem. So, I intend to write this solution as outlined in [5] above. It would be optional, definitely not the default. Applications would be able to use it for Windows environments. The retry behavior would be controlled. In addition, would be controlled if to apply retry logic for lock-delete or not - the default would be 'no' - because in NFS, a delete may return 'failed' due to time-out although it actually succeeded, and a retry logic in this case might "kill" voluntary file locking schemes like the default one used by Lucene (though I assume that with the NFS native locks proposed by Michael this is not the case). Hope this reflects the discussion so far... Attached patch - FSWinDirectory - implements retry logic of FS operations in a separate non default directory class as discussed above.
By default this new class is not used. Applications can start using it by replacing the IMPL class in FSDirectory to be the new class FSWinDirectory. There are two ways to do this - by setting a system property (this is the original mechanism), or by calling FSDirectory static (new) method - setFSDirImplClass(name). There are 3 new classes in this patch:
Few simple modifications were required in FSDirectory, SimpleFSLockFactory and TestLockfactory in order to allow inheritance Tests:
Doron, which version of TortoiseSVN did you have installed when you got the exceptions?
I've installed version 1.4.0 on my Windows XP SP2 box, and then ran your stress test just fine, ie, I can't reproduce the errors (to verify that lock-less commits fixes this). Updated the patch according to review comments by Hoss, plus:
Tested as previous patch:
Doron, I finally managed to see an exception like yours above, but I had to have the Windows Explorer open to the index directory and then right click on files, while the indexing was happening. Once I could get this to happen I then found that the lock-less patch (
Michael, I am not able to generate this with native locks. (did not try with lockless commits).
Which brings me to think that native locks should be made default? There is another thing that bothers me with locks, in NFS or other shared fs situations: The documentation for File.getCanonicalPath() says that it is system dependent. So I am not even sure how it can be guaranteed that Lucene used on Linux and Lucene used on Windows (say) that accesss the same index would be able to lock on the same index. And for two Windows machines, admin would have to verify that the index fs (samba/afs/nfs) mounts with the same drive letter. This seems like a limitation on one hand, and also as a source for possible problems, when users mis configure their mount names. I may be missing someting trivial here, because it seems too wrong to be true... I'll let the list comment on that... Odd that just by using native locking, it stopped your issues. Lucene (without lock-less commits) does quite a bit of file renaming (eg the deletable renaming in your exception above). I don't get why switching to native locking by itself would fix the renaming errors.
Yes the lock prefix is likely to not match when the machines mount to a different point, and almost certainly not if the machines are different OSs. To deal with this I just use LockFactory.setLockPrefix() after the LockFactory has been assigned to a directory. I added to the Javadoc for that method in the native locking implementation for exactly this use case. > Odd that just by using native locking, it stopped your issues.
Agree. I did not expect that to happen, since indeed I saw in the past exceptions on renameFile, though most exceptions were in locks activity. So I ran it many times, with an antivirus scan, etc. But it always passes. Therefore I would not object to closing this issue - If I cannot test it I cannot fix it. But for the same reason, I would like to see native locks becoming the default. > setLockPrefix() I'll take this one to a seprate thread in dev list. Doron can we close this issue now? I think native locking and/or less IO operations with lockless commits has resolved it?
Hi Michael,
Funny that I got this email with reply-to to you rather than the list. Yes, I would like to close the issue - I already said that in my Oct 30 I would like to do this myself - should I "close" or "resolve" the issue? Thanks, atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464174] OK sounds good.
Weird that the reply-to was my email. Normally it's java-dev? I guess I would "resolve" it as "fixed", but don't "close" it yet, see here: http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200605.mbox/%3c4469FDC7.70600@apache.org%3e I don't think it's a "duplicate" since it really is its own bug even if it shared a root cause (& fix) with another bug. And I don't think it's a "won't fix" since it is now fixed (in the trunk). With lockless commits this is no longer reproducable, and although theoretically it seems that in some cases it should be able to reproduce this, practice suggests otherwise, and there seems to be no sufficient justification to introduce retry logic (which is not a 100% solution anyhow).
In case anyone else is looking for this - Jira "life cycle" under discussed in http://www.nabble.com/jira-workflow-tf2459130.html#a6853917
(would have been nice if my "life cycle" query was expanded with "workflow".../ For the workflow this is also useful: http://wiki.apache.org/lucene-hadoop-data/attachments/JiraWorkflow/attachments/workflow.png |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Also, once we switch to native locking (first "decoupling locking implementation from directory implementation":
LUCENE-635, and then I'm working on a LockFactory that uses native locks within that) I think likely this would be fixed as well (assuming that createNewFile is failing because two separate processes are trying to do so at [nearly] the same time).Can you provide more details on the exceptions you're seeing? Especially on the "cannot rename file" exception?