Issue Details (XML | Word | Printable)

Key: LUCENE-665
Type: Bug Bug
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: Unassigned
Reporter: Doron Cohen
Votes: 1
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

temporary file access denied on Windows

Created: 25/Aug/06 11:21 PM   Updated: 12/Jan/07 05:34 PM
Return to search
Component/s: Store
Affects Version/s: 2.0.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File FSDirectory_Retry_Logic.patch 2006-08-25 11:21 PM Doron Cohen 4 kB
Text File Licensed for inclusion in ASF works FSDirs_Retry_Logic_3.patch 2006-08-30 06:24 AM Doron Cohen 8 kB
Text File Licensed for inclusion in ASF works FSWinDirectory.patch 2006-09-20 08:37 AM Doron Cohen 26 kB
Text File Licensed for inclusion in ASF works FSWinDirectory_26_Sep_06.patch 2006-09-27 06:26 AM Doron Cohen 25 kB
Text File Test_Output.txt 2006-08-25 11:21 PM Doron Cohen 3 kB
Java Source File TestInterleavedAddAndRemoves.java 2006-08-25 11:21 PM Doron Cohen 4 kB
Environment: Windows

Resolution Date: 12/Jan/07 05:29 PM


 Description  « Hide
When interleaving adds and removes there is frequent opening/closing of readers and writers.

I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file "access denied" errors - "cannot create a lock file" in "lockFile.createNewFile()" and "cannot rename file".

This is related to:

My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs.

I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure.

After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is temporarily holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories.

So I added to FSDirectory a retry logic in cases the error was related to "Access Denied". This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the hope that a access-denied situation would vanish after a small delay, and the retry would succeed.

I modified FSDirectory this way for "Access Denied" errors during creating a new files, renaming a file.

This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry.

I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory.
All "ant test" tests pass with this patch.

Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path ("D:" , "tmp") is valid.

It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them.

If it turns out useful for others as well, including this patch in the code might help to relieve some of those "frustration" user cases.

A comment on state of proposed patch:

  • It is not a "ready to deploy" code - it has some debug printing, showing the cases that the "retry logic" actually took place.
  • I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant.
  • Should a call to gc() be added? (I think not.)
  • Should the retry be attempted also on "non access-denied" exceptions? (I think not).
  • I feel it is somewhat "woodoo programming", but though I don't like it, it seems to work...

Attached files:
1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch.
2. FSDirectory_Retry_Logic.patch
3. Test_Output.txt- output of the test with the patch, on my XP. Only the createNewFile() case had to be bypassed in this test, but for another program I also saw the renameFile() being bypassed.

  • Doron


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless added a comment - 26/Aug/06 11:09 AM
Doron, just to confirm, is it the COMMIT lock that's throwing these unhandled exceptions (not the WRITE lock)? If so, lockless commits would fix this.

Also, once we switch to native locking (first "decoupling locking implementation from directory implementation": LUCENE-635, and then I'm working on a LockFactory that uses native locks within that) I think likely this would be fixed as well (assuming that createNewFile is failing because two separate processes are trying to do so at [nearly] the same time).

Can you provide more details on the exceptions you're seeing? Especially on the "cannot rename file" exception?


Michael McCandless added a comment - 26/Aug/06 11:48 AM
It may make more sense to trap "Access Denied" in the lock.obtain, but then translate this into "the lock was not acquired" (ie, just return 0). Because, above this code is the retry logic for the lock (which pauses by default for 1.0 sec).

Michael McCandless added a comment - 26/Aug/06 12:18 PM
I'm having trouble reproducing this issue. I copied the TestInterleavedAddAndRemoves.java into src/test/org/apache/lucene/index, then ran the test directly using "java org.junit.runner.JUnitCore org.apache.lucene.index.TestInterleavedAddAndRemoves", using a clean checkout of the current Lucene HEAD. The test is still running and is quite far along and I haven't hit any of the above errors.

I'm running on Windows XP SP2, Sun JDK 1.5.0_07. I wonder if SP1 vs SP2 makes the difference?

Could you also try [temporarily] turning off any virus / malware scanning tools? I wonder if you have one that's doing "live" checking and hold files open? (Though, I have a virus scanner running and it's not causing problems...).

I would like to reproduce this so I could test it against my fixes for lock-less commits!


Doron Cohen added a comment - 28/Aug/06 04:25 AM
> just to confirm, is it the COMMIT lock that's throwing these
> unhandled exceptions (not the WRITE lock)?
> If so, lockless commits would fix this.

In my tests so far, these errors appeared only for commit locks. However I consider this a coincidence - there is nothing as far as I can understand special with commit locks comparing to write locks - in particular they both use createNewFile. So, I agree that lockless commits would prevent this, which is good, but we cannot count on that it would not happen for write locks as well.

Also, the more I think about it the more I like lock-less commits, still, they would take a while to get into Lucene, while this simple fix can help easily now.

Last, with lock-less commits, still, there would be calls to createNewFile for write lock, and there would be calls to renameFile() and other IO file operations, intensively. By having a safety code like the retry logic that is invoked only in rare cases of these unexpected, some nasty errors would be reduced, more users would be happy.

> Can you provide more details on the exceptions you're seeing?
> Especially on the "cannot rename file" exception?

Here is one from my run log, that occurs at the call to optimize, after at the end of all the add-remove iterations -

[junit] java.io.IOException: Cannot rename C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deleteable.new to C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deletable
[junit] at org.apache.lucene.store.FSDirectory.doRenameFile(FSDirectory.java:328)
[junit] at org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:280)
[junit] at org.apache.lucene.index.IndexWriter.writeDeleteableFiles(IndexWriter.java:967)
[junit] at org.apache.lucene.index.IndexWriter.deleteSegments(IndexWriter.java:911)
[junit] at org.apache.lucene.index.IndexWriter.commitChanges(IndexWriter.java:872)
[junit] at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:823)
[junit] at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:798)
[junit] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:614)
[junit] at org.apache.lucene.index.IndexModifier.optimize(IndexModifier.java:304)
[junit] at org.apache.lucene.index.TestBufferedDeletesPerf.doOptimize(TestBufferedDeletesPerf.java:266)
[junit] at org.apache.lucene.index.TestBufferedDeletesPerf.measureInterleavedAddRemove(TestBufferedDeletesPerf.java:218)
[junit] at org.apache.lucene.index.TestBufferedDeletesPerf.doTestBufferedDeletesPerf(TestBufferedDeletesPerf.java:144)
[junit] at org.apache.lucene.index.TestBufferedDeletesPerf.testBufferedDeletesPerfCase7(TestBufferedDeletesPerf.java:134)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:585)
[junit] at junit.framework.TestCase.runTest(TestCase.java:154)
[junit] at junit.framework.TestCase.runBare(TestCase.java:127)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:118)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:208)
[junit] at junit.framework.TestSuite.run(TestSuite.java:203)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:297)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:672)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:567)
[junit] Caused by: java.io.FileNotFoundException: C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deletable (Access is denied)
[junit] at java.io.FileOutputStream.open(Native Method)
[junit] at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
[junit] at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
[junit] at org.apache.lucene.store.FSDirectory.doRenameFile(FSDirectory.java:312)
[junit] ... 27 more

This exception btw is from the performance test for interleaved-adds-and-removes - issue 565 - so IndexWriter line numbers here relate to applying recent patch from issue 565 (though the same errors are obtained with the svn head of IndexWriter).

> It may make more sense to trap "Access Denied" in the lock.obtain,
> but then translate this into "the lock was not acquired" (ie, just return 0).
> Because, above this code is the retry logic for the lock
> (which pauses by default for 1.0 sec).

It is true that when the lock cannot be obtained the existing retry logic in Lock.java could handle it. But when you come to think of it, this is not the purpose of that Lock retry logic - that was for the case that the lock is really acquired by someone else, and we want to stay around for a while to try again. This is not the case here, although the symptoms are similar. Masking this error would not be a good idea. I think it is better for the code in FSDirectory to throw the exception if the retry fails as well (as currently in this patch), and let Lock.java apply its retry logic also for an IOException. If again, the retry of Lock class fails, it would be again problematic to hide the exception.

> I'm having trouble reproducing this issue. I copied the
> TestInterleavedAddAndRemoves.java into src/test/org/apache/lucene/index,
> then ran the test directly using "java org.junit.runner.JUnitCore
> org.apache.lucene.index.TestInterleavedAddAndRemoves",
> using a clean checkout of the current Lucene HEAD.
> The test is still running and is quite far along and I haven't hit any of the above errors.
>
> I'm running on Windows XP SP2, Sun JDK 1.5.0_07. I wonder if SP1 vs SP2 makes the difference?
>
> Could you also try [temporarily] turning off any virus / malware scanning tools?
> I wonder if you have one that's doing "live" checking and hold files open?
> (Though, I have a virus scanner running and it's not causing problems...).

I'm not sure here. I am also running with svn head. I am trying again now, after I turned off anti-virus, and disabled Windows indexing (though the service was already off), and disabled an afs client service that was running. I will report here if the errors happen again.

But I am not sure how this should affect decision on applying this fix - there would always be user machines out there running Lucene and also running other services.

We could tell users - hey, make sure that none of the other services / software running on your machine is holding / touching / examining Lucene index files, otherwise, don't blame Lucene - but this is not easily done. Not all developers out there have control or understanding of what's running on their machines - some programs are installed by a system support, you know how it is.

So, while it is understandable that Lucene would fail if there is a malicious software that actually grabs and holds Lucene files and interfere with them (for "long" periods of times), it would be nice to keep these failures at minimum.

> I would like to reproduce this so I could test it against my fixes for lock-less commits!

The performance test case for 565 is a more aggressive test in this regard - it produced more of these errors for me, including rename() errors. To run it, apply the most recent patch from http://issues.apache.org/jira/browse/LUCENE-565 - that would be NewIndexWriter.Aug23.patch. Notice that the run time (at least on my machine) is over 6 hours... I ran it btw with ant test, after modifying junit.includes in build.xml to run my test.


Doron Cohen added a comment - 28/Aug/06 07:23 AM
Stopping the anti-virus and its friends did not matter - still getting the errors.
However saw a case that the 30ms did not suffice for obtaining the lock in the retry.
Although 30ms was arbitrary in the first place, this is discouraging.
This was before fixing to let Lock.obtain() apply its retry logic in case of such an exception.
So I fixed that (Lock.obtain()) and re-running, now using 100ms instead of 30ms for the one retry in FSDirectory.
Ain't life fun.

Yonik Seeley added a comment - 28/Aug/06 03:24 PM
A single retry in Lock.obtain() makes the error less likely, but certainly not impossible... the second attempt could fail for the same reason.

obtain() is supposed to return success or failure immediately. I'd be tempted to override obtain(timout) for FS locks and keep the retry logic there.

I agree we don't want to mask all IOExceptions and treat them as failure to aquire locks... they should bubble up sooner or later to help diagnose real IOExceptions.


Doron Cohen added a comment - 29/Aug/06 12:10 AM
> obtain() is supposed to return success or failure immediately.
> I'd be tempted to override obtain(timout) for FS locks and keep the retry logic there.

Right, this is the right place for the retry. This way changes are limited to FSDirectory, and obtain() remains unchanged.

I am tesing this now and would subit an updated patch, where:

  • UNEXPECTED_ERROR_RETRY_DELAY is set to 100ms.
  • timeout in obtain(timeout) is always repected (even if the presence of those unexpected io errors).
  • IOExceptions "bubble up" as discussed.

Doron Cohen added a comment - 30/Aug/06 06:24 AM
I am attaching an updated patch - FSDirs_Retry_Logic_3.patch.

In this update:

  • merge with code changes by issue 635 ("decouple locking from directory")
  • modified by recommendations in above comments:
  • do not rely on specific exception message text.
  • overide lock.obtain(timeout) and handle unexpected exceptions there.
  • do not modify logic of obtain() (no changes to this method).
  • UNEXPECTED_ERROR_RETRY_DELAY set to 100ms.
  • debug prints commented out.

"ant test" tests all pass.
My stress IO test passes as well.


Michael McCandless added a comment - 30/Aug/06 10:56 AM

> But I am not sure how this should affect decision on applying this fix
> - there would always be user machines out there running Lucene and
> also running other services.

> We could tell users - hey, make sure that none of the other services /
> software running on your machine is holding / touching / examining
> Lucene index files, otherwise, don't blame Lucene - but this is not
> easily done. Not all developers out there have control or
> understanding of what's running on their machines - some programs are
> installed by a system support, you know how it is.

> So, while it is understandable that Lucene would fail if there is a
> malicious software that actually grabs and holds Lucene files and
> interfere with them (for "long" periods of times), it would be nice to
> keep these failures at minimum.

Alas I still cannot reproduce this. I think there must be some
environmental difference.

I agree, Lucene should strive to be robust to the various
"environmental differences" (OS, filesystem, permissions, virus
checkers installed, etc.) up to a degree, however, I still think it's
best to get to the root cause of these errors so users have the most
information possible: the more information the better. Plus this may
help us build a more accurate fix to the issue than sleeping /
retrying.

For example, if it turns out this happens only under Windows XP SP1,
yes we can try to make Lucene robust to these errors, but in addition,
we should document this so that those users that have the freedom to
do so could upgrade to SP2. (NOTE: I'm just using this as an example:
we still have no idea if it's SP1/SP2 difference that "fixes" the
errors in my testing of this issue).

Given that we have two environments, one very reliably showing these
IO problems (yours) and one very reliably not (mine), this is really a
great chance to get to the root cause. Here are the details of my
env:

OS: Windows XP Pro, SP2
Java: Sun JDK 1.5.0_07
Command line: java org.junit.runner.JUnitCore org.apache.lucene.index.TestInterleavedAddAndRemoves
Services running: Google desktop, Symantec AV


Doron Cohen added a comment - 31/Aug/06 06:38 AM
I think I know which software is causing/exposing this behavior in my environment.
This is the SVN client I am using - TortoiseSVN.

I tried the following sequence:
1) Run with TortoiseSVN installed - the test generates these "access denied:" errors (and bypasses them).
2) Uninstalled TortoiseSVN (+reboot), run test - pass with no "access denied" errorrs.
3) Installed TortoiseSVN again (+reboot), run test - same "access denied" errors again.

I am using most recent stable TotoiseSVN version - 1.3.5 build 6804 - 32 bit, for svn-1.3.2, downloaded from http://tortoisesvn.tigris.org/.

There is an interesting discussion thread of these type of errors on Windows platforms in svn forums - http://svn.haxx.se/dev/archive-2003-10/0136.shtml. At that case it was svn that suffers from these errors.

It says "...Windows allows applications to "tag-along" to see when a file has been written - they will wait for it to close and then do whatever they do, usually opening a file descriptor or handle. This would prevent that file from being renamed for a brief period..."

TortoiseSVN is a shell extension integrated into Windows explorer. As such, it probably demonstrates the "tag-along" behavior described above.

(BTW, it is a great svn client to my opinion)

Here is another excerpt from that discussion thread -
>>
>> sleep(1) would work, I suppose. ;~)
>>
> Most of the time, but not all the time. The only way I've made it work
> well on all the machines I've tried it on is to put it into a sleep(1)
> and retry loop of at least 20 or so attempts. Anything less and it
> still fails on some machines. That implies it is very dependent on
> machine speed or something, which means sleep times/retry times are just
> guessing games at best.
>
> If I could just get it recreated outside of Subversion and prove it's a
> Microsoft problem...although it probably still wouldn't get fixed for
> months at least.

We don't know that this is a bug in TortoiseSVN.
We cannot tell that there are no other such tag-along applications in users machines.
One cannot seriously expect this Win32 behavior to be fixed.

I guess the question is - is it worth for Lucene to attempt to at least reduce chances of failures in this case (I say yes


Michael McCandless added a comment - 31/Aug/06 11:25 PM
Wow! Fantastic sleuthing. I never would have guessed that.

Michael McCandless added a comment - 13/Sep/06 06:39 PM

I just sent this summary of this to java-user:

There is an issue opened on Lucene:

http://issues.apache.org/jira/browse/LUCENE-665

that I'd like to draw your attention to and summarize here because
recently users have hit it.

The gist of the issue is: on Windows, you sometimes see intermittant
"Access Denied" errors in renaming segments.new to segments or
deletable.new to deletable, etc. Lucene typically writes files first
to X.new and then renames then to X.

I know there was at least one recent thread where someone was hitting
this and there have been others in the past (including other Jira
issues).

Anyway, at the end of the issue it was discovered that there was an
unrelated piece of software (TortoiseSVN client) installed which was
using a filesystem "change log" capability in Windows that was
"causing" the problem: uninstalling it made the errors go away.

Unfortunately, there are apparently many software packages that use
this "change log" capability in Windows (virus checkers, Microsoft's
indexing service, etc.) and so the above issue remains open to figure
out whether / how to make Lucene robust to these cases.

But the bottom line is: if you hit these "Access Denied" errors, one
workaround is to try to turn off or uninstall the software that might
be doing this. I realize in many cases that's not an option (it's a
production box; you can't turn off virus checkers; etc.), but at least
it's something to try if you can, until there's some resolution on
that issue.


Michael McCandless added a comment - 13/Sep/06 07:27 PM
I do think we should make Lucene robust to "windows change log"
software.

We could take the position that you have to uninstall such software
because they "conflict" with Lucene, but I don't think that's
realistic. Apparently many packages use this convenient API and that
will only get worse with time.

I would put this under the "Lucene should assume the least common
denominator of filesystem's capabilities" umbrella. Meaning, Lucene
now assumes it can rename files right after closing them, but on
Windows this isn't a safe assumption so if possible we should change
the index format to not require this.

I will try to reproduce this bug with my [upcoming] changes for
lockless commits (numbered segments files) – the lockless commits
changes do much less file renaming, so the issue should be rarer (but
could still occur).


Hoss Man added a comment - 13/Sep/06 09:48 PM
FYI: on the surface FSDirs_Retry_Logic_3.patch scares me because in many cases wait/retry logic is impossed inside of catch(Throwable) blocks ... that's seems a little too broad to me.

Doron Cohen added a comment - 18/Sep/06 07:15 AM
My summary - and "what's next" proposal - for the discussion so far (in comments for issue-665 and in thread http://www.nabble.com/-jira--Created%3A-%28LUCENE-665%29-temporary-file-access-denied-on-Windows-tf2167540.html):

[1] Reported problem can be regenerated in Windows in presence of programs monitoring files.

[2] The proposed fix adds retry after 100ms delay in rare cases where the problem occurs.

[3] That fix reduces much the chances of the problem but does not really solve it.

[4] Proposed fix for FSDirectry not accepted because:
[4.1] 100ms second may be too long for highly interactive programs.
[4.2] 100ms can be insufficient in some cases.
[4.3] non windows environments might be affected with no justification.
[4.4] work in progress "lock-less" commits may reduce chances for this problem.

[5] A Windows-specific implementation of FSDir that would not be the default, but would be available for application to select, was proposed as a better place to host this retry logic, to be available for applications at least until the "lock-less" commits is available for use and proves to solve the same problem.

So, I intend to write this solution as outlined in [5] above. It would be optional, definitely not the default. Applications would be able to use it for Windows environments. The retry behavior would be controlled. In addition, would be controlled if to apply retry logic for lock-delete or not - the default would be 'no' - because in NFS, a delete may return 'failed' due to time-out although it actually succeeded, and a retry logic in this case might "kill" voluntary file locking schemes like the default one used by Lucene (though I assume that with the NFS native locks proposed by Michael this is not the case).

Hope this reflects the discussion so far...


Doron Cohen added a comment - 20/Sep/06 08:37 AM
Attached patch - FSWinDirectory - implements retry logic of FS operations in a separate non default directory class as discussed above.

By default this new class is not used. Applications can start using it by replacing the IMPL class in FSDirectory to be the new class FSWinDirectory.

There are two ways to do this - by setting a system property (this is the original mechanism), or by calling FSDirectory static (new) method - setFSDirImplClass(name).

There are 3 new classes in this patch:

  • FSWinDirectory (extends FSDirectory)
  • SimpleFSWinLockFactory (extends SimpleFSLockFactory)
  • TestWinLockFactory (extends TestLockFactory).

Few simple modifications were required in FSDirectory, SimpleFSLockFactory and TestLockfactory in order to allow inheritance

Tests:

  • "ant test" passes with new code.
  • For test, I modified my copy of build-common.xml to set a system property so that the new WinFS class was always in effect and ran the tests - all passed.
  • my stress test TestinterleavedAddAndRemoves fails in my env by default and passes when FSWinDirectory is in effect.

Michael McCandless added a comment - 20/Sep/06 11:39 AM
Doron, which version of TortoiseSVN did you have installed when you got the exceptions?

I've installed version 1.4.0 on my Windows XP SP2 box, and then ran your stress test just fine, ie, I can't reproduce the errors (to verify that lock-less commits fixes this).


Doron Cohen added a comment - 27/Sep/06 06:26 AM
Updated the patch according to review comments by Hoss, plus:
  • protect currMillis usage from system clock modifications.
  • all Win specific code in a single Java file with two inner classes, for "cleaner" javadocs (now waitForRetry() is provate).

Tested as previous patch:

  • "ant test" passes with new code.
  • For test, modified build-common.xml to set a system property so that the new WinFS class was always in effect and ran the tests - all passed.
  • my stress test TestinterleavedAddAndRemoves fails in my env by default and passes when FSWinDirectory is in effect.

Michael McCandless added a comment - 27/Oct/06 11:56 PM
Doron, I finally managed to see an exception like yours above, but I had to have the Windows Explorer open to the index directory and then right click on files, while the indexing was happening. Once I could get this to happen I then found that the lock-less patch ( LUCENE-701 ) plus native locking seemed to prevent the issue (I think it should because no file renaming is done). But given how hard it is for me to reproduce this, can you try in your area the combination of lock-less and native locking to see if that prevents this issue? Thanks.

Doron Cohen added a comment - 30/Oct/06 07:20 AM
Michael, I am not able to generate this with native locks. (did not try with lockless commits).
Which brings me to think that native locks should be made default?

There is another thing that bothers me with locks, in NFS or other shared fs situations:
Locks are maintained in a specified folder, but a lock file name is derived from the full path of the index dir, actually the cannonical name of this dir. So, if the same index is accessed by two machines, the <drive> / <mount> / <fs> root of that index dir must be named the same in all the machines on which Lucene is invoked to access/maintain that index.

The documentation for File.getCanonicalPath() says that it is system dependent. So I am not even sure how it can be guaranteed that Lucene used on Linux and Lucene used on Windows (say) that accesss the same index would be able to lock on the same index. And for two Windows machines, admin would have to verify that the index fs (samba/afs/nfs) mounts with the same drive letter.

This seems like a limitation on one hand, and also as a source for possible problems, when users mis configure their mount names.

I may be missing someting trivial here, because it seems too wrong to be true... I'll let the list comment on that...


Michael McCandless added a comment - 30/Oct/06 10:21 AM
Odd that just by using native locking, it stopped your issues. Lucene (without lock-less commits) does quite a bit of file renaming (eg the deletable renaming in your exception above). I don't get why switching to native locking by itself would fix the renaming errors.

Yes the lock prefix is likely to not match when the machines mount to a different point, and almost certainly not if the machines are different OSs. To deal with this I just use LockFactory.setLockPrefix() after the LockFactory has been assigned to a directory. I added to the Javadoc for that method in the native locking implementation for exactly this use case.


Doron Cohen added a comment - 30/Oct/06 09:56 PM
> Odd that just by using native locking, it stopped your issues.

Agree. I did not expect that to happen, since indeed I saw in the past exceptions on renameFile, though most exceptions were in locks activity. So I ran it many times, with an antivirus scan, etc. But it always passes. Therefore I would not object to closing this issue - If I cannot test it I cannot fix it. But for the same reason, I would like to see native locks becoming the default.

> setLockPrefix()

I'll take this one to a seprate thread in dev list.


Michael McCandless added a comment - 12/Jan/07 11:15 AM
Doron can we close this issue now? I think native locking and/or less IO operations with lockless commits has resolved it?

Doron Cohen added a comment - 12/Jan/07 04:59 PM
Hi Michael,

Funny that I got this email with reply-to to you rather than the list.
Funnier part is that I really wanted to reply you directly rather than the
list. Is JIRA a mind reader?

Yes, I would like to close the issue - I already said that in my Oct 30
post.

I would like to do this myself - should I "close" or "resolve" the issue?
or perhaps first resolve and then close? I think I read somewhere the life
cycle of an issue but I cannot find it. I am also wondering if it should be
with "won't fix" or "duplicate"?

Thanks,
Doron

atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464174]


Michael McCandless added a comment - 12/Jan/07 05:27 PM
OK sounds good.

Weird that the reply-to was my email. Normally it's java-dev?

I guess I would "resolve" it as "fixed", but don't "close" it yet, see here:

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200605.mbox/%3c4469FDC7.70600@apache.org%3e

I don't think it's a "duplicate" since it really is its own bug even if it shared a root cause (& fix) with another bug. And I don't think it's a "won't fix" since it is now fixed (in the trunk).


Doron Cohen added a comment - 12/Jan/07 05:29 PM
With lockless commits this is no longer reproducable, and although theoretically it seems that in some cases it should be able to reproduce this, practice suggests otherwise, and there seems to be no sufficient justification to introduce retry logic (which is not a 100% solution anyhow).

Doron Cohen added a comment - 12/Jan/07 05:34 PM
In case anyone else is looking for this - Jira "life cycle" under discussed in http://www.nabble.com/jira-workflow-tf2459130.html#a6853917

(would have been nice if my "life cycle" query was expanded with "workflow".../)

For the workflow this is also useful: http://wiki.apache.org/lucene-hadoop-data/attachments/JiraWorkflow/attachments/workflow.png