Derby
  1. Derby
  2. DERBY-5325

Checkpoint fails with ClosedChannelException in InterruptResilienceTest

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.9.1.0
    • Fix Version/s: 10.8.2.2, 10.9.1.0
    • Component/s: Store
    • Labels:
      None
    • Environment:
      Solaris 10 5/08 s10x_u5wos_10 X86
      Java(TM) SE Runtime Environment (build 1.7.0-b147)
      Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17 mixed mode)
    • Bug behavior facts:
      Regression Test Failure

      Description

      Seen here: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/testing/testlog/sol/1144688-suitesAll_diff.txt

      There was 1 error:
      1) testRAFWriteInterrupted(org.apache.derbyTesting.functionTests.tests.store.InterruptResilienceTest)java.sql.SQLException: The exception 'java.sql.SQLException: Log Record has been sent to the stream, but it cannot be applied to the store (Object null). This may cause recovery problems also.' was thrown while evaluating an expression.
      at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
      at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
      at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown Source)
      at org.apache.derbyTesting.functionTests.tests.store.InterruptResilienceTest.testRAFWriteInterrupted(InterruptResilienceTest.java:217)
      (...)
      Caused by: java.nio.channels.ClosedChannelException
      at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:94)
      at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:691)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.writeFull(Unknown Source)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.writeAtOffset(Unknown Source)
      at org.apache.derby.impl.store.raw.data.FileContainer.writeHeader(Unknown Source)
      at org.apache.derby.impl.store.raw.data.RAFContainer.writeRAFHeader(Unknown Source)
      at org.apache.derby.impl.store.raw.data.RAFContainer.clean(Unknown Source)
      at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(Unknown Source)
      at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(Unknown Source)
      at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(Unknown Source)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(Unknown Source)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(Unknown Source)
      at org.apache.derby.impl.store.raw.RawStore.checkpoint(Unknown Source)
      at org.apache.derby.impl.store.raw.log.LogToFile.performWork(Unknown Source)
      at org.apache.derby.impl.services.daemon.BasicDaemon.serviceClient(Unknown Source)
      at org.apache.derby.impl.services.daemon.BasicDaemon.work(Unknown Source)
      at org.apache.derby.impl.services.daemon.BasicDaemon.run(Unknown Source)
      at java.lang.Thread.run(Thread.java:722)

      1. derby.log
        155 kB
        Knut Anders Hatlen
      2. derby-5325a.diff
        11 kB
        Dag H. Wanvik
      3. derby-5325a.stat
        0.1 kB
        Dag H. Wanvik
      4. derby-5325-refactor.diff
        10 kB
        Dag H. Wanvik
      5. derby-5325-refactor.stat
        0.1 kB
        Dag H. Wanvik
      6. derby-5325-refactor-b.diff
        14 kB
        Dag H. Wanvik
      7. error-stacktrace.out
        9 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Attaching derby.log and error-stacktrace.out from the failed run.

          Show
          Knut Anders Hatlen added a comment - Attaching derby.log and error-stacktrace.out from the failed run.
          Hide
          Dag H. Wanvik added a comment - - edited

          This seems to be a hitherto unseen corner case that is not adequately addressed: writeRAFHeader is not always interrupt safe. It should be solvable using the existing pattern.

          Show
          Dag H. Wanvik added a comment - - edited This seems to be a hitherto unseen corner case that is not adequately addressed: writeRAFHeader is not always interrupt safe. It should be solvable using the existing pattern.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a patch for this issue, derby-5325a.

          With NIO, writeRAFHeader has two methods leading to interruptible IO:

          • getEmbryonicPage
          • writeHeader

          Currently, getEmbryonicPage may throw InterruptDetectedException and hence, so may writeRAFHeader.

          writeHeader may throw ClosedByInterruptException, AsynchronousCloseException and ClosedChannelException because writeHeader does not use RAFContainer4#writePage, but rather uses RAFContainer4#writeAtOffset, which does not currently attempt to recover after interrupt.

          So currently, clients of writeRAFHeader need to be prepared for all of InterruptDetectedException, ClosedByInterruptException, AsynchronousCloseException and ClosedChannelException.

          writeRAFHeader is used in three locations:

          • RAFContainer#clean
          • RAFContainer#run(CREATE_CONTAINER_ACTION)
          • RAFContainer#run(STUBBIFY_ACTION)

          RAFContainer#clean is prepared for InterruptDetectedException only. The issue shows that ClosedChannelException may also occur, and it is not prepared for that (this bug).

          RAFContainer#run(CREATE_CONTAINER_ACTION) is prepared for ClosedByInterruptException and AsynchronousCloseException. Since IO during container creation is single-threaded, this is sufficient: it should never need to handle ClosedChannelException/InterruptDetectedException, both of which signal that another thread saw interrupt on the container channel.

          RAFContainer#run(STUBBIFY_ACTION) is part of the removeContainer operation which should happen after the container is closed, so it should be single-threaded on the container as well. It should handle ClosedByInterruptException and AsynchronousCloseException and do retry, but doesn't, currently.

          If we let writeAtOffset clean up just like writePage, RAFContainer4#writeAtOffset (i.e.also writeHeader) would only only throw InterruptDetectedException, i.e. another thread saw interrupt, so retry. This would simplify logic in RAFContainer: we could remove the retry logic from RAFContainer#run(CREATE_CONTAINER_ACTION). This could also cover retry logic for RAFContainer#run(STUBBIFY_ACTION) wrt its use of writeRAFHeader.

          Next, RAFContainer#clean is already handling InterruptDetectedException and would with this change no longer see ClosedByInterruptException, AsynchronousCloseException or ClosedChannelException. This should solve DERBY-5325.

          I did not add a new test for this issue yet since I don't know how to force this scenario. We have only seen it once, I believe. I'll be running InterruptResilienceTest continuously with this patch along with the patch for DERBY-5312 on several platforms to gain more confidence.

          Show
          Dag H. Wanvik added a comment - Uploading a patch for this issue, derby-5325a. With NIO, writeRAFHeader has two methods leading to interruptible IO: getEmbryonicPage writeHeader Currently, getEmbryonicPage may throw InterruptDetectedException and hence, so may writeRAFHeader. writeHeader may throw ClosedByInterruptException, AsynchronousCloseException and ClosedChannelException because writeHeader does not use RAFContainer4#writePage, but rather uses RAFContainer4#writeAtOffset, which does not currently attempt to recover after interrupt. So currently, clients of writeRAFHeader need to be prepared for all of InterruptDetectedException, ClosedByInterruptException, AsynchronousCloseException and ClosedChannelException. writeRAFHeader is used in three locations: RAFContainer#clean RAFContainer#run(CREATE_CONTAINER_ACTION) RAFContainer#run(STUBBIFY_ACTION) RAFContainer#clean is prepared for InterruptDetectedException only. The issue shows that ClosedChannelException may also occur, and it is not prepared for that (this bug). RAFContainer#run(CREATE_CONTAINER_ACTION) is prepared for ClosedByInterruptException and AsynchronousCloseException. Since IO during container creation is single-threaded, this is sufficient: it should never need to handle ClosedChannelException/InterruptDetectedException, both of which signal that another thread saw interrupt on the container channel. RAFContainer#run(STUBBIFY_ACTION) is part of the removeContainer operation which should happen after the container is closed, so it should be single-threaded on the container as well . It should handle ClosedByInterruptException and AsynchronousCloseException and do retry, but doesn't, currently. If we let writeAtOffset clean up just like writePage, RAFContainer4#writeAtOffset (i.e.also writeHeader) would only only throw InterruptDetectedException, i.e. another thread saw interrupt, so retry. This would simplify logic in RAFContainer: we could remove the retry logic from RAFContainer#run(CREATE_CONTAINER_ACTION). This could also cover retry logic for RAFContainer#run(STUBBIFY_ACTION) wrt its use of writeRAFHeader. Next, RAFContainer#clean is already handling InterruptDetectedException and would with this change no longer see ClosedByInterruptException, AsynchronousCloseException or ClosedChannelException. This should solve DERBY-5325 . I did not add a new test for this issue yet since I don't know how to force this scenario. We have only seen it once, I believe. I'll be running InterruptResilienceTest continuously with this patch along with the patch for DERBY-5312 on several platforms to gain more confidence.
          Hide
          Dag H. Wanvik added a comment -

          Regressions passed.

          Show
          Dag H. Wanvik added a comment - Regressions passed.
          Hide
          Dag H. Wanvik added a comment -

          I ran 1000 iteration of InterruptResilienceTest on two different x86/solaris boxes with JDK 1.7 b147 with no incidents after applying both patches. Interestingly, I no longer see DERBY-5333 either whereas as before the patches I did see it typically a few times per 100 iterations.

          Show
          Dag H. Wanvik added a comment - I ran 1000 iteration of InterruptResilienceTest on two different x86/solaris boxes with JDK 1.7 b147 with no incidents after applying both patches. Interestingly, I no longer see DERBY-5333 either whereas as before the patches I did see it typically a few times per 100 iterations.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the detailed analysis, Dag. The patch looks fine to me. It's great that the CREATE_CONTAINER_ACTION in RAFContainer could be simplified, so the checking of the name of the thrown exception was removed, since that approach would probably break down if some implementation threw a sub-class of the expected exception type, or if some later version of the specification refined the exception hierarchy.

          Two minor comments:

          • copy/paste: The debug string in writeAtOffset says that it's in writePage.
          • RAFContainer4 now has three almost identical blocks of error handling code for the interrupted case. Would it be possible to factor this out? For example just do a catch IOException in readPage/writePage/writeAtOffset and do the processing of the IOException in a separate method?
          Show
          Knut Anders Hatlen added a comment - Thanks for the detailed analysis, Dag. The patch looks fine to me. It's great that the CREATE_CONTAINER_ACTION in RAFContainer could be simplified, so the checking of the name of the thrown exception was removed, since that approach would probably break down if some implementation threw a sub-class of the expected exception type, or if some later version of the specification refined the exception hierarchy. Two minor comments: copy/paste: The debug string in writeAtOffset says that it's in writePage. RAFContainer4 now has three almost identical blocks of error handling code for the interrupted case. Would it be possible to factor this out? For example just do a catch IOException in readPage/writePage/writeAtOffset and do the processing of the IOException in a separate method?
          Hide
          Dag H. Wanvik added a comment -

          Thanks for looking at this, Knut. I am aware of the redundancy, it was painful to add it I think refactoring that would be a good next step.

          Show
          Dag H. Wanvik added a comment - Thanks for looking at this, Knut. I am aware of the redundancy, it was painful to add it I think refactoring that would be a good next step.
          Hide
          Dag H. Wanvik added a comment -

          Fixed the cut/paste typo and committed as svn 1148354. Will look to simplify the redundancy in a follow-up patch.

          Show
          Dag H. Wanvik added a comment - Fixed the cut/paste typo and committed as svn 1148354. Will look to simplify the redundancy in a follow-up patch.
          Hide
          Dag H. Wanvik added a comment -

          Attaching a patch to refactor the redundancies discussed, rerunning regressions and my back-to-back runs.

          Show
          Dag H. Wanvik added a comment - Attaching a patch to refactor the redundancies discussed, rerunning regressions and my back-to-back runs.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Dag. The refactoring patch looks good to me. Some minor comments:

          • It would be good if the javadoc comment for the new method had a short description of the method, and not just of its parameters.
          • Typo: InterruptedDetectedException -> InterruptDetectedException
          • The assert in the else block is probably redundant since the declared type of the parameter e is ClosedChannelException. (Perhaps rename the parameter to "cce" to make this more obvious?)
          • Not directly related to the refactoring, but I noticed just now: Perhaps a short comment in writeAtOffset() explaining why we always set stealthMode to true would be in order?
          Show
          Knut Anders Hatlen added a comment - Thanks, Dag. The refactoring patch looks good to me. Some minor comments: It would be good if the javadoc comment for the new method had a short description of the method, and not just of its parameters. Typo: InterruptedDetectedException -> InterruptDetectedException The assert in the else block is probably redundant since the declared type of the parameter e is ClosedChannelException. (Perhaps rename the parameter to "cce" to make this more obvious?) Not directly related to the refactoring, but I noticed just now: Perhaps a short comment in writeAtOffset() explaining why we always set stealthMode to true would be in order?
          Hide
          Dag H. Wanvik added a comment -

          Attaching version "b" of the refactoring patch, which addresses Knut's comments, thanks!

          Show
          Dag H. Wanvik added a comment - Attaching version "b" of the refactoring patch, which addresses Knut's comments, thanks!
          Hide
          Dag H. Wanvik added a comment -

          Back-ported derby-5325a to 10.8 branch at svn 1151135.

          Show
          Dag H. Wanvik added a comment - Back-ported derby-5325a to 10.8 branch at svn 1151135.
          Hide
          Dag H. Wanvik added a comment -

          Committed the "-b" version of the refactoring patch at svn 1151612.

          Show
          Dag H. Wanvik added a comment - Committed the "-b" version of the refactoring patch at svn 1151612.
          Hide
          Dag H. Wanvik added a comment -

          Back-ported to 10.8 branch as svn 1151614, resolving.

          Show
          Dag H. Wanvik added a comment - Back-ported to 10.8 branch as svn 1151614, resolving.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development