Lucene - Core
  1. Lucene - Core
  2. LUCENE-5738

NativeLock is release if Lock is closed after obtain failed

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.8.1
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      if you obtain the NativeFSLock and try to obtain it again in the same JVM and close if if it fails another process will be able to obtain it. This is pretty trappy though. If you execute the main class twice the problem becomes pretty obvious.

      import org.apache.lucene.store.Lock;
      import org.apache.lucene.store.NativeFSLockFactory;
      
      import java.io.File;
      import java.io.IOException;
      
      public class TestLock {
       public static void main(String[] foo) throws IOException, InterruptedException {
              NativeFSLockFactory lockFactory = new NativeFSLockFactory(new File("/tmp"));
              Lock lock = lockFactory.makeLock("LOCK");
              if (lock.obtain()) {
                  System.out.println("OBTAINED");
              } else {
                  lock.close();
                  System.out.println("FAILED");
              }
              // try it again and close it if it fails
              lock = lockFactory.makeLock("LOCK"); // <<<<==== this is a new lock
              if (lock.obtain()) {
                  System.out.println("OBTAINED AGAIN");
              } else {
                  lock.close(); // <<<<==== this releases the lock we obtained
                  System.out.println("FAILED on Second");
              }
              Thread.sleep(Integer.MAX_VALUE);
          }
      }
      
      
      1. LUCENE-5738.patch
        9 kB
        Simon Willnauer
      2. LUCENE-5738.patch
        8 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        Ok this gets more funky... I modified the test since it is not exactly what I was seeing....

         public static void main(String[] foo) throws IOException, InterruptedException {
                NativeFSLockFactory lockFactory = new NativeFSLockFactory(new File("/tmp"));
                Lock lock = lockFactory.makeLock("LOCK");
                if (lock.obtain()) {
                    System.out.println("OBTAINED");
                } else {
                    System.out.println("FAILED");
                }
        
                lock = lockFactory.makeLock("LOCK");   // <<<<==== this is a new lock
                // here when we call obtain we run into an OverlappingFileLockException
                // this exception closes the file channel and that causes the
                // internal file lock table to be invalidated. This essentially releases the W lock of the first descriptor
                if (lock.obtain()) {
                    System.out.println("OBTAINED AGAIN");
                } else {
                    System.out.println("FAILED on Second");
                }
                Thread.sleep(Integer.MAX_VALUE);
            }
        

        it seems that the FileChannel release all locks if it is closed.. There is some funky code in FileChannelImpl.java I would't be suprised if it has bugs

        Show
        Simon Willnauer added a comment - Ok this gets more funky... I modified the test since it is not exactly what I was seeing.... public static void main( String [] foo) throws IOException, InterruptedException { NativeFSLockFactory lockFactory = new NativeFSLockFactory( new File( "/tmp" )); Lock lock = lockFactory.makeLock( "LOCK" ); if (lock.obtain()) { System .out.println( "OBTAINED" ); } else { System .out.println( "FAILED" ); } lock = lockFactory.makeLock( "LOCK" ); // <<<<==== this is a new lock // here when we call obtain we run into an OverlappingFileLockException // this exception closes the file channel and that causes the // internal file lock table to be invalidated. This essentially releases the W lock of the first descriptor if (lock.obtain()) { System .out.println( "OBTAINED AGAIN" ); } else { System .out.println( "FAILED on Second" ); } Thread .sleep( Integer .MAX_VALUE); } it seems that the FileChannel release all locks if it is closed.. There is some funky code in FileChannelImpl.java I would't be suprised if it has bugs
        Hide
        Michael McCandless added a comment -

        Simple (non-Lucene) Java test repros the issue:

        import java.io.File;
        import java.io.IOException;
        
        import java.nio.channels.FileChannel;
        import java.nio.channels.FileLock;
        
        import java.nio.channels.OverlappingFileLockException;
        import java.nio.file.StandardOpenOption;
        
        public class TestLock {
            public static void obtain() throws Exception {
                File path = new File("/tmp/LOCK");
                FileChannel channel = FileChannel.open(path.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE);
                System.out.println("got channel " + channel);
                FileLock lock = null;
                try {
                    lock = channel.tryLock();
                } catch (IOException | OverlappingFileLockException e) {
                    // failed to get lock
                } finally {
                    if (lock == null) {
                        System.out.println("FAILED");
                        channel.close();
                     } else {
                        System.out.println("SUCCESS");
                    }
                }
            }
        
            public static void main(String[] foo) throws Exception {
                obtain();
                obtain();
                Thread.sleep(Integer.MAX_VALUE);
            }
        }
        
        Show
        Michael McCandless added a comment - Simple (non-Lucene) Java test repros the issue: import java.io.File; import java.io.IOException; import java.nio.channels.FileChannel; import java.nio.channels.FileLock; import java.nio.channels.OverlappingFileLockException; import java.nio.file.StandardOpenOption; public class TestLock { public static void obtain() throws Exception { File path = new File("/tmp/LOCK"); FileChannel channel = FileChannel.open(path.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE); System.out.println("got channel " + channel); FileLock lock = null; try { lock = channel.tryLock(); } catch (IOException | OverlappingFileLockException e) { // failed to get lock } finally { if (lock == null) { System.out.println("FAILED"); channel.close(); } else { System.out.println("SUCCESS"); } } } public static void main(String[] foo) throws Exception { obtain(); obtain(); Thread.sleep(Integer.MAX_VALUE); } }
        Hide
        Michael McCandless added a comment -

        Uwe Schindler I think the JVM is still buggy here, and we need that hacky static map back maybe?

        Show
        Michael McCandless added a comment - Uwe Schindler I think the JVM is still buggy here, and we need that hacky static map back maybe?
        Hide
        Michael McCandless added a comment -

        Also it's really too bad our tests didn't detect this.

        Show
        Michael McCandless added a comment - Also it's really too bad our tests didn't detect this.
        Hide
        Uwe Schindler added a comment -

        Uwe Schindler I think the JVM is still buggy here, and we need that hacky static map back maybe?

        The static map was there to not obtain the lock 2 times from the same JVM. It could help to redo this, but I am not yet sure, what the real bug is. Does this also release the lock of another process? If yes, its a problem of the O/S.

        Show
        Uwe Schindler added a comment - Uwe Schindler I think the JVM is still buggy here, and we need that hacky static map back maybe? The static map was there to not obtain the lock 2 times from the same JVM. It could help to redo this, but I am not yet sure, what the real bug is. Does this also release the lock of another process? If yes, its a problem of the O/S.
        Hide
        Michael McCandless added a comment -

        Somehow, the 2nd channel.close releases the first lock obtained in the same process.

        I.e., if you start that test program you'll see it print SUCCESS then FAILED, and while it's still running go run it again in another window, and that 2nd process also prints SUCCESS then FAILED.

        Show
        Michael McCandless added a comment - Somehow, the 2nd channel.close releases the first lock obtained in the same process. I.e., if you start that test program you'll see it print SUCCESS then FAILED, and while it's still running go run it again in another window, and that 2nd process also prints SUCCESS then FAILED.
        Hide
        Richard Boulton added a comment -

        I believe that Java's NativeFSLock uses fcntl on Linux. Unfortunately, the semantics of fcntl are consistent with the behaviour described above.

        Specifically:

        • a process has first to open a file descriptor to obtain a lock with fcntl
        • when that file descriptor is closed, the lock is released
        • however, the lock is also released if any file descriptor on the same underlying file is closed by the same process as is holding the lock. (I'm not certain of the behaviour when file handles are passed to other processes, either explicitly or by forking.)
        • the lock is not released, however, if a different process opens and then closes a file descriptor on the file.

        So, there are two approaches I know of to get the semantics you probably want (ie, that the only way the lock is closed is if the process holding it exits, or the lock is explicitly closed).

        1: make sure files which are locked aren't opened multiple times while locked. This probably needs some process-wide state to track which files have locks held on them. This is, of course, awkward for
        a library, since you don't have control over code in the same process which isn't part of the library.
        2: fork a subprocess to hold the lock open. This is expensive, but is the approach we took with Xapian. I'm not sure it would be workable if you lock things at all frequently, though.

        Show
        Richard Boulton added a comment - I believe that Java's NativeFSLock uses fcntl on Linux. Unfortunately, the semantics of fcntl are consistent with the behaviour described above. Specifically: a process has first to open a file descriptor to obtain a lock with fcntl when that file descriptor is closed, the lock is released however, the lock is also released if any file descriptor on the same underlying file is closed by the same process as is holding the lock. (I'm not certain of the behaviour when file handles are passed to other processes, either explicitly or by forking.) the lock is not released, however, if a different process opens and then closes a file descriptor on the file. So, there are two approaches I know of to get the semantics you probably want (ie, that the only way the lock is closed is if the process holding it exits, or the lock is explicitly closed). 1: make sure files which are locked aren't opened multiple times while locked. This probably needs some process-wide state to track which files have locks held on them. This is, of course, awkward for a library, since you don't have control over code in the same process which isn't part of the library. 2: fork a subprocess to hold the lock open. This is expensive, but is the approach we took with Xapian. I'm not sure it would be workable if you lock things at all frequently, though.
        Hide
        Richard Boulton added a comment -

        In case it wasn't clear in my previous comment, the problem is that when you try to obtain a lock on a file already locked by the same process, you open and then close a second file descriptor on the file, and when that file descriptor is closed, the lock is removed by the OS.

        Note this from the FileLock documentation (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/channels/FileLock.java#28 )

        "On some systems, closing a channel releases all locks held by the Java virtual machine on the underlying file regardless of whether the locks were acquired via that channel or via another channel open on the same file. It is strongly recommended that, within a program, a unique channel be used to acquire all locks on any given file."

        Show
        Richard Boulton added a comment - In case it wasn't clear in my previous comment, the problem is that when you try to obtain a lock on a file already locked by the same process, you open and then close a second file descriptor on the file, and when that file descriptor is closed, the lock is removed by the OS. Note this from the FileLock documentation ( http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/channels/FileLock.java#28 ) "On some systems, closing a channel releases all locks held by the Java virtual machine on the underlying file regardless of whether the locks were acquired via that channel or via another channel open on the same file. It is strongly recommended that, within a program, a unique channel be used to acquire all locks on any given file."
        Hide
        Hoss Man added a comment -

        Somehow, the 2nd channel.close releases the first lock obtained in the same process.

        Isn't this a valid aspect of the API as documented?

        http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileLock.html

        On some systems, closing a channel releases all locks held by the Java virtual machine on the underlying file regardless of whether the locks were acquired via that channel or via another channel open on the same file. It is strongly recommended that, within a program, a unique channel be used to acquire all locks on any given file.

        Show
        Hoss Man added a comment - Somehow, the 2nd channel.close releases the first lock obtained in the same process. Isn't this a valid aspect of the API as documented? http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileLock.html On some systems, closing a channel releases all locks held by the Java virtual machine on the underlying file regardless of whether the locks were acquired via that channel or via another channel open on the same file. It is strongly recommended that, within a program, a unique channel be used to acquire all locks on any given file.
        Hide
        Simon Willnauer added a comment -

        Isn't this a valid aspect of the API as documented?

        well IMO that is not what is happening. Yes, the lock is released such that another process can acquire it but the same process can't and that is what makes this trappy and inconsistent IMO. However I think we have to bring back the static map to this and ramp up the tests otherwise this is too trappy

        Show
        Simon Willnauer added a comment - Isn't this a valid aspect of the API as documented? well IMO that is not what is happening. Yes, the lock is released such that another process can acquire it but the same process can't and that is what makes this trappy and inconsistent IMO. However I think we have to bring back the static map to this and ramp up the tests otherwise this is too trappy
        Hide
        Uwe Schindler added a comment -

        Hi,

        in that case, with the help of Robert Muir, we should re-add the process-wide static map of already aquired locks and check against it before aquiring a new lock.

        Richard Boulton: Java file handles never escape from the local process. We have only 2 use-cases:

        • A single process must prevent access to the same index. This is the issue here and can be solved by re-adding the static map of locks.
        • Another process must prevent access to the same index: This is currently no issue, because it cannot release the lock of other process.
        Show
        Uwe Schindler added a comment - Hi, in that case, with the help of Robert Muir , we should re-add the process-wide static map of already aquired locks and check against it before aquiring a new lock. Richard Boulton : Java file handles never escape from the local process. We have only 2 use-cases: A single process must prevent access to the same index. This is the issue here and can be solved by re-adding the static map of locks. Another process must prevent access to the same index: This is currently no issue, because it cannot release the lock of other process.
        Hide
        Uwe Schindler added a comment -

        well IMO that is not what is happening.

        It is waht is happening. The close() of the file handle is releasing the lock of the other file handle in the same process. This has nothing to do with the problem of the lock aquire failing, its just whats documented. So a single process should never try to aquire the lock multiple times through filesystem. We should only use the native lock between processes, for the single process case we should use the map.

        Show
        Uwe Schindler added a comment - well IMO that is not what is happening. It is waht is happening. The close() of the file handle is releasing the lock of the other file handle in the same process. This has nothing to do with the problem of the lock aquire failing, its just whats documented. So a single process should never try to aquire the lock multiple times through filesystem. We should only use the native lock between processes, for the single process case we should use the map.
        Hide
        Uwe Schindler added a comment -

        Also it's really too bad our tests didn't detect this.

        The new test, we added in Lucene 4.7, cannot ctahc this, because it only tests that locking between processes work. But we should add the failing test above to a conventional unit test. It can be done with one process, i.e. a single unit test:

        • Aquire the lock
        • Try to aquire the lock a second time
        • closae the failed lock
        • try to aquire the lock a thrid time -> it should still not work
        • relaease the master lock
        • try to aquire again -> should now work
        Show
        Uwe Schindler added a comment - Also it's really too bad our tests didn't detect this. The new test, we added in Lucene 4.7, cannot ctahc this, because it only tests that locking between processes work. But we should add the failing test above to a conventional unit test. It can be done with one process, i.e. a single unit test: Aquire the lock Try to aquire the lock a second time closae the failed lock try to aquire the lock a thrid time -> it should still not work relaease the master lock try to aquire again -> should now work
        Hide
        Simon Willnauer added a comment -

        It is waht is happening. The close() of the file handle is releasing the lock of the other file handle in the same process. This has nothing to do with the problem of the lock aquire failing, its just whats documented. So a single process should never try to aquire the lock multiple times through filesystem. We should only use the native lock between processes, for the single process case we should use the map.

        no you are not reading it right... what I am saying is that:

        • obtain lock
        • try again from same process, (this fails & closes the channel & release the nativ lock on the FS)
        • try again from same process (this fails again while another process can succeed)

        this is a problem here since it seems that the JVM prevents itself from obtaining it twice. That is what I am arguing about and we can't detect this with a unittest in a single process.

        Show
        Simon Willnauer added a comment - It is waht is happening. The close() of the file handle is releasing the lock of the other file handle in the same process. This has nothing to do with the problem of the lock aquire failing, its just whats documented. So a single process should never try to aquire the lock multiple times through filesystem. We should only use the native lock between processes, for the single process case we should use the map. no you are not reading it right... what I am saying is that: obtain lock try again from same process, (this fails & closes the channel & release the nativ lock on the FS) try again from same process (this fails again while another process can succeed) this is a problem here since it seems that the JVM prevents itself from obtaining it twice. That is what I am arguing about and we can't detect this with a unittest in a single process.
        Hide
        Simon Willnauer added a comment -

        here is a patch that adds a more elegant static map and fixed the stress test to fail without the map. It now passes but we should stress it a bit more and run it for a while.

        Show
        Simon Willnauer added a comment - here is a patch that adds a more elegant static map and fixed the stress test to fail without the map. It now passes but we should stress it a bit more and run it for a while.
        Hide
        Michael McCandless added a comment -

        +1, patch looks good

        I wouldn't use the world elegant: this is an annoying JDK trap that we have to work around!

        You changed the test to also sometimes try to acquire the lock (in the same process) when it already has the lock, and assert that 2nd acquire failed. With the bug, this would release the first lock the process had acquired, allowing the 2nd process to illegally obtain the lock, and then the server fails.

        Can you rename LOCK_MARKERS to LOCK_HELD (and clearMarkedLock to clearHeldLock) to make it clear that's the purpose of the map? Also the indent of clearMarkedLock is off.

        Show
        Michael McCandless added a comment - +1, patch looks good I wouldn't use the world elegant: this is an annoying JDK trap that we have to work around! You changed the test to also sometimes try to acquire the lock (in the same process) when it already has the lock, and assert that 2nd acquire failed. With the bug, this would release the first lock the process had acquired, allowing the 2nd process to illegally obtain the lock, and then the server fails. Can you rename LOCK_MARKERS to LOCK_HELD (and clearMarkedLock to clearHeldLock) to make it clear that's the purpose of the map? Also the indent of clearMarkedLock is off.
        Hide
        Simon Willnauer added a comment -

        updated patch including CHANGES.TXT entry.

        Show
        Simon Willnauer added a comment - updated patch including CHANGES.TXT entry.
        Hide
        Michael McCandless added a comment -

        +1, looks great. Thanks Simon!

        Show
        Michael McCandless added a comment - +1, looks great. Thanks Simon!
        Hide
        Uwe Schindler added a comment - - edited

        Hi Simon,

        looks good. Sorry for confusing responses. The issue is, as you say, the combination of 2 issues, of which one is a real bug in the JDK:

        • The lock is released if you call close on any FileChannel
        • The JDK bug where we cannot requaire the lock in the same JVM, because the JVM "thinks" its still held, but it isnt, so another process can get it.

        About the patch:
        I like it, much simplier than before Robert's cleanup. One small thing: In Java 7 we should use IOUtils only if required, for the use case here (the finally block) we can use a cool "trick". The pros for doing it like that is, that no Exceptions may be supressed, the are recorded:

        Replace:

        } finally {
          if (obtained == false) { // not successful - clear up and move out
            clearMarkedLock(path);
            final FileChannel toClose = channel;
            channel = null;
            IOUtils.closeWhileHandlingException(toClose);
          }
        }
        

        by

        } finally {
          if (obtained == false) { // not successful - clear up and move out
            try (FileChannel toClose = channel) {
              clearMarkedLock(path);
              channel = null;
            }
          }
        }
        

        I will look into the LockStressTest, but for now it looks fine.

        You can run the stress tester for very long time using some system properties (like running it the whole night).

        Show
        Uwe Schindler added a comment - - edited Hi Simon, looks good. Sorry for confusing responses. The issue is, as you say, the combination of 2 issues, of which one is a real bug in the JDK: The lock is released if you call close on any FileChannel The JDK bug where we cannot requaire the lock in the same JVM, because the JVM "thinks" its still held, but it isnt, so another process can get it. About the patch: I like it, much simplier than before Robert's cleanup. One small thing: In Java 7 we should use IOUtils only if required, for the use case here (the finally block) we can use a cool "trick". The pros for doing it like that is, that no Exceptions may be supressed, the are recorded: Replace: } finally { if (obtained == false ) { // not successful - clear up and move out clearMarkedLock(path); final FileChannel toClose = channel; channel = null ; IOUtils.closeWhileHandlingException(toClose); } } by } finally { if (obtained == false ) { // not successful - clear up and move out try (FileChannel toClose = channel) { clearMarkedLock(path); channel = null ; } } } I will look into the LockStressTest, but for now it looks fine. You can run the stress tester for very long time using some system properties (like running it the whole night).
        Hide
        ASF subversion and git services added a comment -

        Commit 1600827 from Simon Willnauer in branch 'dev/trunk'
        [ https://svn.apache.org/r1600827 ]

        LUCENE-5738: Ensure NativeFSLock prevents opening the file channel twice if lock is held

        Show
        ASF subversion and git services added a comment - Commit 1600827 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1600827 ] LUCENE-5738 : Ensure NativeFSLock prevents opening the file channel twice if lock is held
        Hide
        ASF subversion and git services added a comment -

        Commit 1600831 from Simon Willnauer in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1600831 ]

        LUCENE-5738: Ensure NativeFSLock prevents opening the file channel twice if lock is held

        Show
        ASF subversion and git services added a comment - Commit 1600831 from Simon Willnauer in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1600831 ] LUCENE-5738 : Ensure NativeFSLock prevents opening the file channel twice if lock is held
        Hide
        Simon Willnauer added a comment -

        committed to 4x and trunk

        Show
        Simon Willnauer added a comment - committed to 4x and trunk
        Hide
        Uwe Schindler added a comment -

        Commit was a little bit too fast for me... But no problem at all, was just a comment to not use IOUtils if not really needed. The code pattern I posted before is from official Oracle JDK code (used like that now at many places). It is somehow a "misuse" of try-with-resources, but very elegant.

        Show
        Uwe Schindler added a comment - Commit was a little bit too fast for me... But no problem at all, was just a comment to not use IOUtils if not really needed. The code pattern I posted before is from official Oracle JDK code (used like that now at many places). It is somehow a "misuse" of try-with-resources, but very elegant.
        Hide
        Simon Willnauer added a comment -

        I saw your comment. I didn't want to change the RT behaviour with respect to closing the channel in this issue. I think it's fine to convert places in a different issue.

        Show
        Simon Willnauer added a comment - I saw your comment. I didn't want to change the RT behaviour with respect to closing the channel in this issue. I think it's fine to convert places in a different issue.
        Hide
        Uwe Schindler added a comment -

        Wanted to note: Had the LockVerify test running with ant test-lock-factory -Dlockverify.count=500000 running for an hour on Windows, works.

        Show
        Uwe Schindler added a comment - Wanted to note: Had the LockVerify test running with ant test-lock-factory -Dlockverify.count=500000 running for an hour on Windows, works.

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development