Derby
  1. Derby
  2. DERBY-4741

Make embedded Derby work reliably in the presence of thread interrupts

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.2.1.6, 10.2.2.0, 10.3.1.4, 10.3.2.1, 10.3.3.0, 10.4.1.3, 10.4.2.0, 10.5.1.1, 10.5.2.0, 10.5.3.0, 10.6.1.0
    • Fix Version/s: 10.8.1.2
    • Component/s: Store
    • Urgency:
      Normal
    • Bug behavior facts:
      Regression

      Description

      When not executing on a small device VM, Derby has been using the Java NIO classes java.nio.clannel.* for file io for extra concurrency.
      If a thread is interrupted while executing blocking IO operations in NIO, the ClosedByInterruptException will get thrown. Unfortunately, Derby isn't current architected to retry and complete such operations (before passing on the interrupt), so the Derby database store can be left in an inconsistent state, although no data is corrupted, and we therefore have to return a database level error to perform shutdown and recovery. This means the applications can no longer access the database while a shutdown and reboot including a recovery is taking place.

      It would be nice if Derby could somehow detect and finish IO operations underway when thread interrupts happen before passing the exception on to the application. Derby embedded is sometimes embedded in applications that use Thread.interrupt to stop threads.

      1. derby.log
        4 kB
        Lily Wei
      2. derby.log
        27 kB
        Lily Wei
      3. derby-4741-a-01-api-interruptstatus.diff
        39 kB
        Dag H. Wanvik
      4. derby-4741-a-01-api-interruptstatus.stat
        0.7 kB
        Dag H. Wanvik
      5. derby-4741-a-02-api-interruptstatus.diff
        45 kB
        Dag H. Wanvik
      6. derby-4741-a-02-api-interruptstatus.stat
        1.0 kB
        Dag H. Wanvik
      7. derby-4741-a-03-api-interruptstatus.diff
        44 kB
        Dag H. Wanvik
      8. derby-4741-a-03-api-interruptstatus.stat
        1.0 kB
        Dag H. Wanvik
      9. derby-4741-a-04-api-interruptstatus.diff
        43 kB
        Dag H. Wanvik
      10. derby-4741-a-04-api-interruptstatus.stat
        1.0 kB
        Dag H. Wanvik
      11. derby-4741-all+lenient+resurrect.diff
        116 kB
        Dag H. Wanvik
      12. derby-4741-all+lenient+resurrect.stat
        2 kB
        Dag H. Wanvik
      13. derby-4741-b-01-nio.diff
        61 kB
        Dag H. Wanvik
      14. derby-4741-b-01-nio.stat
        0.7 kB
        Dag H. Wanvik
      15. derby-4741-b-02-nio.diff
        61 kB
        Dag H. Wanvik
      16. derby-4741-b-02-nio.stat
        0.6 kB
        Dag H. Wanvik
      17. derby-4741-b-03-nio.diff
        63 kB
        Dag H. Wanvik
      18. derby-4741-b-03-nio.stat
        0.6 kB
        Dag H. Wanvik
      19. derby-4741-b-04-nio.diff
        65 kB
        Dag H. Wanvik
      20. derby-4741-b-04-nio.stat
        0.6 kB
        Dag H. Wanvik
      21. derby-4741-c-01-nio.diff
        7 kB
        Dag H. Wanvik
      22. derby-4741-c-01-nio.stat
        0.1 kB
        Dag H. Wanvik
      23. derby-4741-fix-compilation.diff
        2 kB
        Dag H. Wanvik
      24. derby-4741-fix-compilation.stat
        0.2 kB
        Dag H. Wanvik
      25. derby-4741-kristians-01.diff
        4 kB
        Dag H. Wanvik
      26. derby-4741-nio-container+log.diff
        59 kB
        Dag H. Wanvik
      27. derby-4741-nio-container+log.stat
        0.8 kB
        Dag H. Wanvik
      28. derby-4741-nio-container+log+waits.diff
        63 kB
        Dag H. Wanvik
      29. derby-4741-nio-container+log+waits.stat
        0.8 kB
        Dag H. Wanvik
      30. derby-4741-nio-container+log+waits+locks.diff
        73 kB
        Dag H. Wanvik
      31. derby-4741-nio-container+log+waits+locks.stat
        1 kB
        Dag H. Wanvik
      32. derby-4741-nio-container+log+waits+locks+throws.diff
        100 kB
        Dag H. Wanvik
      33. derby-4741-nio-container+log+waits+locks+throws.stat
        2 kB
        Dag H. Wanvik
      34. derby-4741-nio-container+log+waits+locks-2.diff
        80 kB
        Dag H. Wanvik
      35. derby-4741-nio-container+log+waits+locks-2.stat
        1 kB
        Dag H. Wanvik
      36. derby-4741-nio-container-2.diff
        25 kB
        Dag H. Wanvik
      37. derby-4741-nio-container-2.log
        55 kB
        Dag H. Wanvik
      38. derby-4741-nio-container-2.stat
        0.4 kB
        Dag H. Wanvik
      39. derby-4741-nio-container-2b.diff
        39 kB
        Dag H. Wanvik
      40. derby-4741-nio-container-2b.stat
        0.4 kB
        Dag H. Wanvik
      41. derby-4741-raf-stresstest-1.diff
        17 kB
        Dag H. Wanvik
      42. derby-4741-raf-stresstest-1.stat
        0.2 kB
        Dag H. Wanvik
      43. derby-4741-raf-stresstest-2.diff
        17 kB
        Dag H. Wanvik
      44. derby-4741-raf-stresstest-2.stat
        0.2 kB
        Dag H. Wanvik
      45. derby-4741-raf-stresstest-3.diff
        20 kB
        Dag H. Wanvik
      46. derby-4741-raf-stresstest-3.stat
        0.2 kB
        Dag H. Wanvik
      47. derby-4741-raf-stresstest-4.diff
        23 kB
        Dag H. Wanvik
      48. derby-4741-raf-stresstest-4.stat
        0.4 kB
        Dag H. Wanvik
      49. derby-4741-sleeps-waits-1.diff
        5 kB
        Dag H. Wanvik
      50. derby-4741-sleeps-waits-1.stat
        0.2 kB
        Dag H. Wanvik
      51. derby-4741-sleeps-waits-2.diff
        19 kB
        Dag H. Wanvik
      52. derby-4741-sleeps-waits-2.stat
        0.8 kB
        Dag H. Wanvik
      53. derby-4741-sleeps-waits-3.diff
        19 kB
        Dag H. Wanvik
      54. derby-4741-sleeps-waits-3.stat
        0.8 kB
        Dag H. Wanvik
      55. derby-4741-sleeps-waits-more.diff
        8 kB
        Dag H. Wanvik
      56. derby-4741-sleeps-waits-more.stat
        0.5 kB
        Dag H. Wanvik
      57. derby-4741-testBatchInterrupt.diff
        10 kB
        Dag H. Wanvik
      58. derby-4741-testBatchInterrupt-b.diff
        3 kB
        Dag H. Wanvik
      59. derby-4741-testQueryInterrupt.diff
        8 kB
        Dag H. Wanvik
      60. derby-4741-testQueryInterrupt.stat
        0.2 kB
        Dag H. Wanvik
      61. InterruptResilienceTest.java
        9 kB
        Lily Wei
      62. interrupts-fs.html
        8 kB
        Dag H. Wanvik
      63. interrupts-fs.html
        8 kB
        Dag H. Wanvik
      64. interrupts-fs.html
        7 kB
        Dag H. Wanvik
      65. interrupts-fs.html
        6 kB
        Dag H. Wanvik
      66. MicroAPITest.java
        2 kB
        Dag H. Wanvik
      67. sleep-1-usages.txt
        2 kB
        Dag H. Wanvik
      68. wait-0-usages.txt
        5 kB
        Dag H. Wanvik
      69. wait-1-usages.txt
        1 kB
        Dag H. Wanvik
      70. xsbt0.log.gz
        23 kB
        Dag H. Wanvik

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Since Derby used to handle interrupts in a better way before we switched to NIO, would it make sense to change the issue type to bug and flag it as a regression?

          Show
          Knut Anders Hatlen added a comment - Since Derby used to handle interrupts in a better way before we switched to NIO, would it make sense to change the issue type to bug and flag it as a regression?
          Hide
          Dag H. Wanvik added a comment -

          Yes, I agree, Knut. Just did it

          Show
          Dag H. Wanvik added a comment - Yes, I agree, Knut. Just did it
          Hide
          Dag H. Wanvik added a comment -

          Uploading an experimental patch which upon seeing the container channel interrupted/closed,
          closes and reopens the container to allow completion of the I/O.

          Using a modified Derby151Test, the trace on my box (OpenSolaris snv_148, Java 1.6) shows how
          the RAFContainer4.java I/O code recovers. When an interrupt is detected (in the form of an interrupted channel),
          the thread's interrupt flag is tucked away in a thread local variable for now, and the flag is reset, so the thread can continue and retry the I/O operation when the container has been resurrected.

          The idea is that the thread local variable might be checked "higher up" somewhere, where throwing an exception would not make the database go down.

          During this investigation, I have found numerous other locations at which an interrupt will make Derby go down, though, so RAFContainer4.java (or in deed NIO) is not the only weak spot we have.
          Running the test on Windows, I see Derby choke on trying to switch log files, cf the enclosed derby.log file "derby-4741-nio-container-2.log" due to seeing a ChannelClosedException on the log file (NIO channel.force).

          Show
          Dag H. Wanvik added a comment - Uploading an experimental patch which upon seeing the container channel interrupted/closed, closes and reopens the container to allow completion of the I/O. Using a modified Derby151Test, the trace on my box (OpenSolaris snv_148, Java 1.6) shows how the RAFContainer4.java I/O code recovers. When an interrupt is detected (in the form of an interrupted channel), the thread's interrupt flag is tucked away in a thread local variable for now, and the flag is reset, so the thread can continue and retry the I/O operation when the container has been resurrected. The idea is that the thread local variable might be checked "higher up" somewhere, where throwing an exception would not make the database go down. During this investigation, I have found numerous other locations at which an interrupt will make Derby go down, though, so RAFContainer4.java (or in deed NIO) is not the only weak spot we have. Running the test on Windows, I see Derby choke on trying to switch log files, cf the enclosed derby.log file "derby-4741-nio-container-2.log" due to seeing a ChannelClosedException on the log file (NIO channel.force).
          Hide
          Dag H. Wanvik added a comment -

          Uploading a modified version of the experimental patch, nio-container-2b, which also includes a standalone test I have been using to flush out effects of interrupts on threads.

          Show
          Dag H. Wanvik added a comment - Uploading a modified version of the experimental patch, nio-container-2b, which also includes a standalone test I have been using to flush out effects of interrupts on threads.
          Hide
          Dag H. Wanvik added a comment -

          Uploading derby-4741-nio-container+log, which builds on the previous
          experimental patch in the following ways:

          • Adds logic to recover when switching the log file gets interrupted
            (seen on Windows using Derby151Test). The retry had to precolate up
            from the NIO code, so I use an internal exception (temporarily
            borrow an exisiting one; should make a new one later)
          • Makes RAFContainer4#getEmbryonicPage use a minion of readPage, so
            getEmbryonicPage can take advantage of the recovery machinery as
            well (it previously did a direct call to readFull, which made it
            vulnerable to being interrupted; this was also seen on Windows.

          Unfortunately, the latter could lead to deadlocks, because when
          getEmbryonicPage is called from writeRAFHeader, the thread has a
          lock on "this". If another thread is has been interrupted and is
          about to do recovery, it would get stuck on waiting for the monitor
          on "this", while the getEmbryonicPage reader would get stuck on
          waiting on recovery to finish. To solve this, I had to let reads
          from getEmbryonicPage throw an internal exception so it can back out
          and release the monitor on "this" (in RAFContainer#clean), and do a
          retry from that level.

          The patch is just a snapshot of my experiments, only intended so
          people could comment on the approach.

          Show
          Dag H. Wanvik added a comment - Uploading derby-4741-nio-container+log, which builds on the previous experimental patch in the following ways: Adds logic to recover when switching the log file gets interrupted (seen on Windows using Derby151Test). The retry had to precolate up from the NIO code, so I use an internal exception (temporarily borrow an exisiting one; should make a new one later) Makes RAFContainer4#getEmbryonicPage use a minion of readPage, so getEmbryonicPage can take advantage of the recovery machinery as well (it previously did a direct call to readFull, which made it vulnerable to being interrupted; this was also seen on Windows. Unfortunately, the latter could lead to deadlocks, because when getEmbryonicPage is called from writeRAFHeader, the thread has a lock on "this". If another thread is has been interrupted and is about to do recovery, it would get stuck on waiting for the monitor on "this", while the getEmbryonicPage reader would get stuck on waiting on recovery to finish. To solve this, I had to let reads from getEmbryonicPage throw an internal exception so it can back out and release the monitor on "this" (in RAFContainer#clean), and do a retry from that level. The patch is just a snapshot of my experiments, only intended so people could comment on the approach.
          Hide
          Dag H. Wanvik added a comment -

          Now, heading back to Solaris after having flushed out NIO issues on Windows, I see that Derby falls over when it gets interrupted
          during transaction abort (XSTB0), as a result of receving an interrupt during a log flush wait(LogToFile.java:4048). I saw this using InterruptedTest in the patch. The test tries to reconnect (reboot), but eventually I can no longer get a connection, because I hit a NullPointerException in BaseDataFileFactory.java:639 (containerCache empty??). Cf enclosed xsbt0.log.

          Show
          Dag H. Wanvik added a comment - Now, heading back to Solaris after having flushed out NIO issues on Windows, I see that Derby falls over when it gets interrupted during transaction abort (XSTB0), as a result of receving an interrupt during a log flush wait(LogToFile.java:4048). I saw this using InterruptedTest in the patch. The test tries to reconnect (reboot), but eventually I can no longer get a connection, because I hit a NullPointerException in BaseDataFileFactory.java:639 (containerCache empty??). Cf enclosed xsbt0.log.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a revised experimental patch, derby-4741-nio-container+log+waits, which intercepts and saves away interrupts in

          • LogToFile#flush, when a thread is waiting for another thread to flush the log.
          • BasePage#setExclusive (2 places)
          • BasePage#setExclusiveNoWait

          It also fixes a bug in NIO getEmbryonicPage; the new implementation tried to decrypt when reading the embryonic page if the database were encrypted.

          With these changes in place, the next issue that I sometime see is that we can get interrupted waiting for a lock.

          java.lang.InterruptedException
          at java.lang.Object.wait(Native Method)
          at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(ActiveLock.java:118)
          at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(ConcurrentLockSet.java:469)

          In turn this led to continual failure because I hit this ASSERT:

          at org.apache.derby.shared.common.sanity.SanityManager.THROWASSERT(SanityManager.java:147)
          at org.apache.derby.impl.services.locks.LockControl.addLock(LockControl.java:276)

          Knut suggested to me we try to treat this situation similarly to how we handle lock time-outs and just throw an interrupted exception at safe place. Sounds good to me.

          Show
          Dag H. Wanvik added a comment - Uploading a revised experimental patch, derby-4741-nio-container+log+waits, which intercepts and saves away interrupts in LogToFile#flush, when a thread is waiting for another thread to flush the log. BasePage#setExclusive (2 places) BasePage#setExclusiveNoWait It also fixes a bug in NIO getEmbryonicPage; the new implementation tried to decrypt when reading the embryonic page if the database were encrypted. With these changes in place, the next issue that I sometime see is that we can get interrupted waiting for a lock. java.lang.InterruptedException at java.lang.Object.wait(Native Method) at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(ActiveLock.java:118) at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(ConcurrentLockSet.java:469) In turn this led to continual failure because I hit this ASSERT: at org.apache.derby.shared.common.sanity.SanityManager.THROWASSERT(SanityManager.java:147) at org.apache.derby.impl.services.locks.LockControl.addLock(LockControl.java:276) Knut suggested to me we try to treat this situation similarly to how we handle lock time-outs and just throw an interrupted exception at safe place. Sounds good to me.
          Hide
          Mike Matrigali added a comment -

          so far your approach seems fine to me, and I think will help a lot of customers. I was wondering if
          you have a high level goal. Basically what should derby do when it encounters an interrupt.

          I think it should always throw some sort of error back to caller, and would be nice if the error or
          the nested error indicated clearly that it is being thrown because of a thread interrupt. Many times
          when supporting this issue it takes awhile for the user to realize that they are the ones generating
          the interrupt.

          I think the error should not be database level, and with your retry on I/O and log it seems like we
          can avoid that. I am not sure if it should be session or statement level,
          I am leaning to it being session level, but would like to see a discussion. I know often users are
          doing the interrupt to stop a thread.

          Ultimately do you plan on always reenabling the interrupt after retrying or getting to a safe place?

          We should definitely throw an error in the case of an interrupt during a lock wait, this seems like the
          one valid case to send an interrupt to derby if a user thinks it is waiting too long. Hopefully you can
          code it to handle it, do processing, get to safe spot and then throw an error indicating a real user
          interrupt during a wait.

          Something to think about in the error throwing is the background thread. What should we do if
          it encounters an interrupt. Throwing an error there might shutdown the db, i am not sure what it
          does at top level with an error.

          Show
          Mike Matrigali added a comment - so far your approach seems fine to me, and I think will help a lot of customers. I was wondering if you have a high level goal. Basically what should derby do when it encounters an interrupt. I think it should always throw some sort of error back to caller, and would be nice if the error or the nested error indicated clearly that it is being thrown because of a thread interrupt. Many times when supporting this issue it takes awhile for the user to realize that they are the ones generating the interrupt. I think the error should not be database level, and with your retry on I/O and log it seems like we can avoid that. I am not sure if it should be session or statement level, I am leaning to it being session level, but would like to see a discussion. I know often users are doing the interrupt to stop a thread. Ultimately do you plan on always reenabling the interrupt after retrying or getting to a safe place? We should definitely throw an error in the case of an interrupt during a lock wait, this seems like the one valid case to send an interrupt to derby if a user thinks it is waiting too long. Hopefully you can code it to handle it, do processing, get to safe spot and then throw an error indicating a real user interrupt during a wait. Something to think about in the error throwing is the background thread. What should we do if it encounters an interrupt. Throwing an error there might shutdown the db, i am not sure what it does at top level with an error.
          Hide
          Dag H. Wanvik added a comment -

          Hi Mike, thanks for looking at this!

          > I was wondering if you have a high level
          > goal. Basically what should derby do when it encounters an
          > interrupt.

          I started off merely trying to make the new IO pieces stand up as well
          as the old I/O, but I guess my ambitions grew a bit as I found many
          things could bring Derby down..

          My thinking so far is that Derby should:

          a) avoid having to shut down if a thread has been interrupted

          b) throw something when it finds the thread has been interrupted,
          preferably as soon as possible, but not before the code has arrived at
          a "safe place", so we can avoid Derby shutting down.

          > I think it should always throw some sort of error back to caller,
          > and would be nice if the error or the nested error indicated clearly
          > that it is being thrown because of a thread interrupt. Many times
          > when supporting this issue it takes awhile for the user to realize
          > that they are the ones generating the interrupt.

          Agreed. So far I have been thinking of just using the connection level error 08000,
          wrapping the original NIO channel exception, or InterruptedException.
          This should make it clear what has happened, and unwrapping the exception would
          show where Derby detected it.

          > I think the error should not be database level, and with your retry
          > on I/O and log it seems like we can avoid that.

          That's what I am trying to achieve, yes.

          > I am not sure if it should be session or statement level, I am
          > leaning to it being session level, but would like to see a
          > discussion. I know often users are doing the interrupt to stop a
          > thread.

          The existing code throws session level error 08000 already in many
          places, and I felt it's OK to require that the user reconnect when she
          has done something as drastic as interrupting the thread

          > Ultimately do you plan on always reenabling the interrupt after
          > retrying or getting to a safe place?

          I am still pondering this question. I think that by the principle of
          least surprise to the user, we should set the interrupted flag of the
          thread again just before we throw the exception from the "safe place".

          Note: the exisiting code does not resurrect the
          thread's interrupted flag when it detects an interrupt of a wait and throws
          StandardException.interrupt. (wait would have cleared the flag).

          An imminent problem is where would the "safe place" be? It would be
          nice to avoid having to check in all JDBC API methods before we
          return if Derby has been interrupted during the API call, but I am not
          yet sure if I am able to determine conclusively which API code paths
          could lead to Derby being interrupted.. One approach would be to throw
          as "soon as possible" from a "safe place" on the stack above the
          method that saw the interrupt, but it may be hard to always determine
          where that would be in all cases. I am open to suggestions here

          The current experimental patch mostly doesn't throw yet (I didn't get
          that far) - it just makes a note that an interrupt was detected. Nor
          does it resurrect the interrupted flag when it does throw - since I
          was still torn on this.

          > We should definitely throw an error in the case of an interrupt
          > during a lock wait, this seems like the one valid case to send an
          > interrupt to derby if a user thinks it is waiting too
          > long. Hopefully you can code it to handle it, do processing, get to
          > safe spot and then throw an error indicating a real user interrupt
          > during a wait.

          Yes, I am trying to do exactly that, agreed.

          I also would like interrupts on queries to be detected no later that
          at the place where we check for query time-outs already,
          cf. BasicNoPutResultSetImpl#checkCancellationFlag.

          > Something to think about in the error throwing is the background
          > thread. What should we do if it encounters an interrupt. Throwing an
          > error there might shutdown the db, i am not sure what it does at top
          > level with an error.

          Right. A priori, I thought the most common use case would be user
          threads getting interrupted, but I'll look into that. One approach
          could be to shutdown, or possibly try to make it impervious and ignore
          all interrupts..

          Show
          Dag H. Wanvik added a comment - Hi Mike, thanks for looking at this! > I was wondering if you have a high level > goal. Basically what should derby do when it encounters an > interrupt. I started off merely trying to make the new IO pieces stand up as well as the old I/O, but I guess my ambitions grew a bit as I found many things could bring Derby down.. My thinking so far is that Derby should: a) avoid having to shut down if a thread has been interrupted b) throw something when it finds the thread has been interrupted, preferably as soon as possible, but not before the code has arrived at a "safe place", so we can avoid Derby shutting down. > I think it should always throw some sort of error back to caller, > and would be nice if the error or the nested error indicated clearly > that it is being thrown because of a thread interrupt. Many times > when supporting this issue it takes awhile for the user to realize > that they are the ones generating the interrupt. Agreed. So far I have been thinking of just using the connection level error 08000, wrapping the original NIO channel exception, or InterruptedException. This should make it clear what has happened, and unwrapping the exception would show where Derby detected it. > I think the error should not be database level, and with your retry > on I/O and log it seems like we can avoid that. That's what I am trying to achieve, yes. > I am not sure if it should be session or statement level, I am > leaning to it being session level, but would like to see a > discussion. I know often users are doing the interrupt to stop a > thread. The existing code throws session level error 08000 already in many places, and I felt it's OK to require that the user reconnect when she has done something as drastic as interrupting the thread > Ultimately do you plan on always reenabling the interrupt after > retrying or getting to a safe place? I am still pondering this question. I think that by the principle of least surprise to the user, we should set the interrupted flag of the thread again just before we throw the exception from the "safe place". Note: the exisiting code does not resurrect the thread's interrupted flag when it detects an interrupt of a wait and throws StandardException.interrupt. (wait would have cleared the flag). An imminent problem is where would the "safe place" be? It would be nice to avoid having to check in all JDBC API methods before we return if Derby has been interrupted during the API call, but I am not yet sure if I am able to determine conclusively which API code paths could lead to Derby being interrupted.. One approach would be to throw as "soon as possible" from a "safe place" on the stack above the method that saw the interrupt, but it may be hard to always determine where that would be in all cases. I am open to suggestions here The current experimental patch mostly doesn't throw yet (I didn't get that far) - it just makes a note that an interrupt was detected. Nor does it resurrect the interrupted flag when it does throw - since I was still torn on this. > We should definitely throw an error in the case of an interrupt > during a lock wait, this seems like the one valid case to send an > interrupt to derby if a user thinks it is waiting too > long. Hopefully you can code it to handle it, do processing, get to > safe spot and then throw an error indicating a real user interrupt > during a wait. Yes, I am trying to do exactly that, agreed. I also would like interrupts on queries to be detected no later that at the place where we check for query time-outs already, cf. BasicNoPutResultSetImpl#checkCancellationFlag. > Something to think about in the error throwing is the background > thread. What should we do if it encounters an interrupt. Throwing an > error there might shutdown the db, i am not sure what it does at top > level with an error. Right. A priori, I thought the most common use case would be user threads getting interrupted, but I'll look into that. One approach could be to shutdown, or possibly try to make it impervious and ignore all interrupts..
          Hide
          Dag H. Wanvik added a comment -

          Uploading a new version of the experimental patch, derby-4741-nio-container+log+waits+locks,
          which

          a) handles interrupts during lock wait
          b) retries when LogToFile fails during call to switchLogFile#preAllocateNewLogFile's call to syncFile, which I also saw,
          but there are probably more places where logging can fail
          c) adds a "lock" argument to InterruptTest to test safe throws for a)

          Show
          Dag H. Wanvik added a comment - Uploading a new version of the experimental patch, derby-4741-nio-container+log+waits+locks, which a) handles interrupts during lock wait b) retries when LogToFile fails during call to switchLogFile#preAllocateNewLogFile's call to syncFile, which I also saw, but there are probably more places where logging can fail c) adds a "lock" argument to InterruptTest to test safe throws for a)
          Hide
          Knut Anders Hatlen added a comment - - edited

          Thanks Dag. The proposed plan sounds good to me.

          As to what to do with the interrupt status after a retry, I'd be happy as long as we always either fail or preserve the interrupt status. I don't have any strong opinions on whether or not we should set the interrupt flag when raising the exception. I see that the pre-NIO methods that may be interrupted (wait, sleep, join) will throw an exception and clear the interrupt status, whereas the interruptible NIO methods will throw an exception and set the interrupt status, so the precedence from the class library is ambiguous. I agree, though, that the user's intent is probably to stop the thread, and then preserving the interrupt status sounds reasonable in order to prevent that the thread just runs to the next blocking call and gets blocked there.

          Show
          Knut Anders Hatlen added a comment - - edited Thanks Dag. The proposed plan sounds good to me. As to what to do with the interrupt status after a retry, I'd be happy as long as we always either fail or preserve the interrupt status. I don't have any strong opinions on whether or not we should set the interrupt flag when raising the exception. I see that the pre-NIO methods that may be interrupted (wait, sleep, join) will throw an exception and clear the interrupt status, whereas the interruptible NIO methods will throw an exception and set the interrupt status, so the precedence from the class library is ambiguous. I agree, though, that the user's intent is probably to stop the thread, and then preserving the interrupt status sounds reasonable in order to prevent that the thread just runs to the next blocking call and gets blocked there.
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Knut! I agree it's a bit weird that the JRE is a little schizophrenic about the interrupt flag. I think we should try to be consistent when we throw, no matter where/after which Java API call Derby detects it. If we do retain the interrupt flag, I think it adds to the argument for making the exception session-level, since the user would need to undertake some cleaning up anyway, or else the next Derby call would fail as well. Probably this would be the use case for connection pools. As to arguments for making the error statement level, I guess if one just wants to stop a run-away query, one would possibly prefer that just the statement failed. A thing to consider here is the new JDBC 4.1 method Connection#abort:
          This would be the future portable way of stopping a run-away transaction. It will also close the the connection.

          This patch passed regressions, for what it's worth.

          Show
          Dag H. Wanvik added a comment - Thanks, Knut! I agree it's a bit weird that the JRE is a little schizophrenic about the interrupt flag. I think we should try to be consistent when we throw, no matter where/after which Java API call Derby detects it. If we do retain the interrupt flag, I think it adds to the argument for making the exception session-level, since the user would need to undertake some cleaning up anyway, or else the next Derby call would fail as well. Probably this would be the use case for connection pools. As to arguments for making the error statement level, I guess if one just wants to stop a run-away query, one would possibly prefer that just the statement failed. A thing to consider here is the new JDBC 4.1 method Connection#abort: This would be the future portable way of stopping a run-away transaction. It will also close the the connection. This patch passed regressions, for what it's worth.
          Hide
          Lily Wei added a comment -

          Thanks to Dag for fixing this bug. It will benefit a lot of customers. As for making the exception on either session-level or statement level, I will prefer to have it stop a run-away query and have the exception jut s statement failed. For most web application, it could be easier to rewrite the interrupt near statement than session perspective.
          As test for nio+container+log+waits+locks, it also passed suites.all on my windows 7 machine. However, when running mailjdbc tests, the test got exception on Error 40XD1: Container was opened in read-only mode for Browsing thread, on Error '-1:java.lang.ArrayIndexOutOfBoundsException' for Backup Thread.
          I am not totally sure why the different behavior for this patch. However, with derby-4741-nio-container+log patch, mailjdbc get deadlock exception. I am attach derby.log for reference point.

          Show
          Lily Wei added a comment - Thanks to Dag for fixing this bug. It will benefit a lot of customers. As for making the exception on either session-level or statement level, I will prefer to have it stop a run-away query and have the exception jut s statement failed. For most web application, it could be easier to rewrite the interrupt near statement than session perspective. As test for nio+container+log+waits+locks, it also passed suites.all on my windows 7 machine. However, when running mailjdbc tests, the test got exception on Error 40XD1: Container was opened in read-only mode for Browsing thread, on Error '-1:java.lang.ArrayIndexOutOfBoundsException' for Backup Thread. I am not totally sure why the different behavior for this patch. However, with derby-4741-nio-container+log patch, mailjdbc get deadlock exception. I am attach derby.log for reference point.
          Hide
          Dag H. Wanvik added a comment -

          Hi Lily, thanks for test driving the patches Du you know if mailjdbc uses interrupts in any way? I'll try to run it myself, maybe some of the changes changed behavior even in the absence of interrupts. Btw, there are still many holes, I have found more places especially in the logging system that are vulnerable to
          interrupts, so I am still building out the experimental patch.

          Show
          Dag H. Wanvik added a comment - Hi Lily, thanks for test driving the patches Du you know if mailjdbc uses interrupts in any way? I'll try to run it myself, maybe some of the changes changed behavior even in the absence of interrupts. Btw, there are still many holes, I have found more places especially in the logging system that are vulnerable to interrupts, so I am still building out the experimental patch.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Dag,

          There's some discussion about mailjdbc and interrupts on DERBY-3746. The comments there seem to indicate that the test sometimes interrupts the worker threads.

          Show
          Knut Anders Hatlen added a comment - Hi Dag, There's some discussion about mailjdbc and interrupts on DERBY-3746 . The comments there seem to indicate that the test sometimes interrupts the worker threads.
          Hide
          Lily Wei added a comment -

          Hi Dag:
          Thank you. I am so exciting to help out testing and hope I can do more. Like Knut said, mailjdbc interrupts on Browse.java and more detail on DERBY-3746 and DERBY-4166, the portion of the code on Browse.java that call interrupt is:
          //Checking whether Refresh thread is running after doing Browse work
          //If Refresh is not running, interrupt the thread
          if (ThreadUtils.isThreadRunning("Refresh Thread"))

          { MailJdbc.logAct.logMsg("******** Refresh is running"); }

          else

          { Refresh th = (Refresh) ThreadUtils .getThread("Refresh Thread"); th.interrupt(); }

          Refresh thread and Brose thread are dealing with the same tables for mailjdbc application. I am also reading Derby151Test to see whether we should simplify the mailjdbc test to better fit testing this fix that will benefit a lot of customers. Thanks again. By the way, the read-only error on Browse thread and ArrayIndexOutOfBoundsException from backup thread is from running mailjdbc with embedded server.

          Show
          Lily Wei added a comment - Hi Dag: Thank you. I am so exciting to help out testing and hope I can do more. Like Knut said, mailjdbc interrupts on Browse.java and more detail on DERBY-3746 and DERBY-4166 , the portion of the code on Browse.java that call interrupt is: //Checking whether Refresh thread is running after doing Browse work //If Refresh is not running, interrupt the thread if (ThreadUtils.isThreadRunning("Refresh Thread")) { MailJdbc.logAct.logMsg("******** Refresh is running"); } else { Refresh th = (Refresh) ThreadUtils .getThread("Refresh Thread"); th.interrupt(); } Refresh thread and Brose thread are dealing with the same tables for mailjdbc application. I am also reading Derby151Test to see whether we should simplify the mailjdbc test to better fit testing this fix that will benefit a lot of customers. Thanks again. By the way, the read-only error on Browse thread and ArrayIndexOutOfBoundsException from backup thread is from running mailjdbc with embedded server.
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Knut and Lily.
          Ok, good to know that there is an interrupt involved when mailjdbc fails. Lily, also please note that Derby151Test is its present form will probably change when I start to actually throw Derby exceptions when interrupts are detected. So far, they are mostly only noted in InterruptStatus (thread local) and normal operation continues. The latest patch has started to throw exceptions for interrupts while waiting for locks, though, cf. mods in LockSet.java/ConcurrentLockSet.java, since this seemed the obvious place to throw anyway.

          Show
          Dag H. Wanvik added a comment - Thanks, Knut and Lily. Ok, good to know that there is an interrupt involved when mailjdbc fails. Lily, also please note that Derby151Test is its present form will probably change when I start to actually throw Derby exceptions when interrupts are detected. So far, they are mostly only noted in InterruptStatus (thread local) and normal operation continues. The latest patch has started to throw exceptions for interrupts while waiting for locks, though, cf. mods in LockSet.java/ConcurrentLockSet.java, since this seemed the obvious place to throw anyway.
          Hide
          Dag H. Wanvik added a comment -

          Lily, I looked at derby.log: The stack trace you see first: "ERROR 40XD1: Container was opened in read-only mode." seems to happen
          while Derby is opening the conglomerate in the normal fashion and finds for some reason that it (or the directory) is not writable. It is not obvious to me that
          this has to do with the patch, but it is not impossible, I suppose, since the patch does touch RAFContainer(4), e.g. by reopening the container if the channel has been closed by interrupts. But this seems to be a normal open, not a reopen called from RAFContainer4 (which is not on the stack).
          Cf. RAFContainer#run, action OPEN_CONTAINER_ACTION, ca line 1447.

          I'll see if I can run the test myself and reproduced the error.

          Show
          Dag H. Wanvik added a comment - Lily, I looked at derby.log: The stack trace you see first: "ERROR 40XD1: Container was opened in read-only mode." seems to happen while Derby is opening the conglomerate in the normal fashion and finds for some reason that it (or the directory) is not writable. It is not obvious to me that this has to do with the patch, but it is not impossible, I suppose, since the patch does touch RAFContainer(4), e.g. by reopening the container if the channel has been closed by interrupts. But this seems to be a normal open, not a reopen called from RAFContainer4 (which is not on the stack). Cf. RAFContainer#run, action OPEN_CONTAINER_ACTION, ca line 1447. I'll see if I can run the test myself and reproduced the error.
          Hide
          Dag H. Wanvik added a comment -

          Attaching derby-4741-nio-container+log+waits+locks-2, which adds more interrupt recovery logic, mainly to LogToFile.java and LogAccessFile. With this version I have not yet been able to make crash the database (i.e. provoke a database level error) using the three run modes of InterruptedTest - default, i.e. write, read and lock, so we are gettting more resilient.

          Only interrupted lock waits are added as new Derby exception cases yet, I'll start looking into where to throw more seriously now. Indeed, some existing cases where we did throw in LogToFile, don't throw in the current patch, we just take a note and retry.

          Show
          Dag H. Wanvik added a comment - Attaching derby-4741-nio-container+log+waits+locks-2, which adds more interrupt recovery logic, mainly to LogToFile.java and LogAccessFile. With this version I have not yet been able to make crash the database (i.e. provoke a database level error) using the three run modes of InterruptedTest - default, i.e. write, read and lock, so we are gettting more resilient. Only interrupted lock waits are added as new Derby exception cases yet, I'll start looking into where to throw more seriously now. Indeed, some existing cases where we did throw in LogToFile, don't throw in the current patch, we just take a note and retry.
          Hide
          Dag H. Wanvik added a comment -

          Lily, I ran mailjdbc test with my latest patch on Windows Vista for about 20 hours; I saw deadlocks, but not the read-only container error.
          I did see several mysterious lines in derby.log, e.g.

          "not switching from an empty log file (870)"

          similar to what you had in your derby.log. I have no experience with this test, so I don't know if this is normal for this test or a sign that something is wrong; I'll try to run the test without the patch and compare.

          Show
          Dag H. Wanvik added a comment - Lily, I ran mailjdbc test with my latest patch on Windows Vista for about 20 hours; I saw deadlocks, but not the read-only container error. I did see several mysterious lines in derby.log, e.g. "not switching from an empty log file (870)" similar to what you had in your derby.log. I have no experience with this test, so I don't know if this is normal for this test or a sign that something is wrong; I'll try to run the test without the patch and compare.
          Hide
          Dag H. Wanvik added a comment -

          Re mailjdbc: It seems "not switching from an empty log file" in derby.log is normal for this test, at least I see it on trunk without the patch.

          Show
          Dag H. Wanvik added a comment - Re mailjdbc: It seems "not switching from an empty log file" in derby.log is normal for this test, at least I see it on trunk without the patch.
          Hide
          Lily Wei added a comment -

          Thanks Dag for spending time to look at mailjdbc. Yes, it is normal and I see it on trunk without the patch too. However, I haven't see the deadlock with the patch on my Windows 7 machine. I had been running 16 hours. It will be great for Derby if the deadlock is not normal. I will run more to see.

          Show
          Lily Wei added a comment - Thanks Dag for spending time to look at mailjdbc. Yes, it is normal and I see it on trunk without the patch too. However, I haven't see the deadlock with the patch on my Windows 7 machine. I had been running 16 hours. It will be great for Derby if the deadlock is not normal. I will run more to see.
          Hide
          Dag H. Wanvik added a comment -

          Uploading another experimental patch, derby-4741-nio-container+log+waits+locks+throws.

          The change now is

          • some further generalizing of recovery logic in RAFContainer4, the readPage logic was missing some stuff I had added for writePage
          • introduced InterruptTest#throwIf, which is a helper method to check & reset the thread's interrupt status and throw an SQLException (08000 for now) if set.
          • added throwIf next to Query timeout logic in BasicNoPutResultSetImpl#checkCancellationFlag
          • added throwIf before all returns in impl/jdbc/*.java in methods which call (directly or indirectly) handleException(t), typically this pattern:
            :
            InterruptTest#throwIf()
            return;
            } catch (Throwable t) {
            throw handleException(t);

          This will make sure Derby throws if it ever did see an interrupt during operation, which was intercepted for safety and postponed till a "safe place". This is a "catch-all", though, and I'll look for other places closer to the
          originating sites (container I/O, log I/O) so Derby could throw earlier.

          For example, commit can now throw if we see an interrupt during writing to log, cf. call to throwIf in EmbedConnection#commit.

          Currently all interrupts throw 08000 which is session level, and none yet set the thread's intr flag. I will go over all usages of 08000 and set the flag next if we agree that's the right thing to do. Note that Derby151Test fails now, but InterruptTest should work and report all interrupts Derby see (some interrupts happen while the test is in application code; this is called out in the test's results). Btw, InterruptTest is not yet a JUnit test..

          The patch is getting a bit large and unwieldy now, but not to worry, the patch is just a proof of concept, I'll break it down into smaller pieces before any proposed commit. Rerunning regressions.

          I see Lily would prefer statement level error, rather than session level. I am sticking to session level for now since that's what the exisiting 08000 currently is. I'll go over the remaining existing usages of 08000 which I didn't already replace with "InterruptStatus.setInterrupted" + retry logic. If all are ok to replace, we could probably choose to change the exception to statement level rather than session level. If not, it might be confusing to have two kinds of SQLExceptions for interrupts, with different severity...

          Show
          Dag H. Wanvik added a comment - Uploading another experimental patch, derby-4741-nio-container+log+waits+locks+throws. The change now is some further generalizing of recovery logic in RAFContainer4, the readPage logic was missing some stuff I had added for writePage introduced InterruptTest#throwIf, which is a helper method to check & reset the thread's interrupt status and throw an SQLException (08000 for now) if set. added throwIf next to Query timeout logic in BasicNoPutResultSetImpl#checkCancellationFlag added throwIf before all returns in impl/jdbc/*.java in methods which call (directly or indirectly) handleException(t), typically this pattern: : InterruptTest#throwIf() return; } catch (Throwable t) { throw handleException(t); This will make sure Derby throws if it ever did see an interrupt during operation, which was intercepted for safety and postponed till a "safe place". This is a "catch-all", though, and I'll look for other places closer to the originating sites (container I/O, log I/O) so Derby could throw earlier. For example, commit can now throw if we see an interrupt during writing to log, cf. call to throwIf in EmbedConnection#commit. Currently all interrupts throw 08000 which is session level, and none yet set the thread's intr flag. I will go over all usages of 08000 and set the flag next if we agree that's the right thing to do. Note that Derby151Test fails now, but InterruptTest should work and report all interrupts Derby see (some interrupts happen while the test is in application code; this is called out in the test's results). Btw, InterruptTest is not yet a JUnit test.. The patch is getting a bit large and unwieldy now, but not to worry, the patch is just a proof of concept, I'll break it down into smaller pieces before any proposed commit. Rerunning regressions. I see Lily would prefer statement level error, rather than session level. I am sticking to session level for now since that's what the exisiting 08000 currently is. I'll go over the remaining existing usages of 08000 which I didn't already replace with "InterruptStatus.setInterrupted" + retry logic. If all are ok to replace, we could probably choose to change the exception to statement level rather than session level. If not, it might be confusing to have two kinds of SQLExceptions for interrupts, with different severity...
          Hide
          Dag H. Wanvik added a comment -

          Uploading the last patch again (derby-4741-nio-container+log+waits+locks+throws) - it had a small bug that broke LogSwitchFail.unit, fixed now

          Show
          Dag H. Wanvik added a comment - Uploading the last patch again (derby-4741-nio-container+log+waits+locks+throws) - it had a small bug that broke LogSwitchFail.unit, fixed now
          Hide
          Dag H. Wanvik added a comment -

          A concern with additing the "catch-all" check is performance for fast API operations, like ResultSet#next, which usually only just fetches the next prefetched row (no I/O). I'll run some micro benchmarks to study if adding the interrupt check is too expensive in such a case.

          Show
          Dag H. Wanvik added a comment - A concern with additing the "catch-all" check is performance for fast API operations, like ResultSet#next, which usually only just fetches the next prefetched row (no I/O). I'll run some micro benchmarks to study if adding the interrupt check is too expensive in such a case.
          Hide
          Lily Wei added a comment -

          Thanks Dag for giving such details about each change. It definitely helps people like me to learn and understand more. The patch is getting big and complicated. However, I run mailjdbc again the latest derby-4741-nio-container+log+waits+locks+throws, the test run into Error ERROR 40001: A lock could not be obtained due to a deadlock. derby.log is attached for reference. Do we need InterruptException on synchronized (slaveRecoveryMonitor) in LogToFile.recover(...)? Will change interrupt exception to statement level rather than session level require more code changes? Such as code in impl/store/raw/xact/* I was just thinking statement level is more user friendly to application users.

          Show
          Lily Wei added a comment - Thanks Dag for giving such details about each change. It definitely helps people like me to learn and understand more. The patch is getting big and complicated. However, I run mailjdbc again the latest derby-4741-nio-container+log+waits+locks+throws, the test run into Error ERROR 40001: A lock could not be obtained due to a deadlock. derby.log is attached for reference. Do we need InterruptException on synchronized (slaveRecoveryMonitor) in LogToFile.recover(...)? Will change interrupt exception to statement level rather than session level require more code changes? Such as code in impl/store/raw/xact/* I was just thinking statement level is more user friendly to application users.
          Hide
          Dag H. Wanvik added a comment -

          Hi Lily,

          thanks again for looking at this

          > However, I run mailjdbc again the latest
          > derby-4741-nio-container+log+waits+locks+throws, the test run into
          > Error ERROR 40001: A lock could not be obtained due to a
          > deadlock. derby.log is attached for reference.

          Yes, I saw that such lock errors too after about 18 hours of so. I am
          not yet sure they have anything to do with the patch changes, but I
          can't rule it out. I don't know the test app well enough. If you have
          any insight here which would indicate that locks should not occur
          under normal operation, what would that be? We could try to run the
          patch without the changes I made to ActiveLock, ConcurrentLockSet and
          LockSet, and see if the changes there have something to do with any
          change of behaviour..

          > Do we need
          > InterruptException on synchronized (slaveRecoveryMonitor) in
          > LogToFile.recover(...)?

          I don't think I changed that part of the code. I see that an interrupt
          is just ignored there, so I think we can safely add a call to
          "InterruptStatus.setInterrupted" there as well, if that's what you
          mean.

          Generally I have just been concentrating on "normal operation" yet, so
          far I haven't yet looked into things like replication, backup,
          export/import, or even LOB/CLOB streams yet. I suspect there may be
          issues with interrupts happening during I/O in those areas as well.

          > Will change interrupt exception to statement
          > level rather than session level require more code changes? Such as
          > code in impl/store/raw/xact/*

          I don't know that for sure, but I suspected it might, the cleanup on
          session termination just seemed safer and in line with what we already did
          for interrupt exceptions. More investigation needed.

          > I was just thinking statement level is more user friendly to
          > application users.

          Yes, I see that point. If we decide that's the way we would ultimately
          like to go, I think I will still defer that to a next increment (maybe
          a new JIRA issue?) of this work just to keep things simpler for now.

          Show
          Dag H. Wanvik added a comment - Hi Lily, thanks again for looking at this > However, I run mailjdbc again the latest > derby-4741-nio-container+log+waits+locks+throws, the test run into > Error ERROR 40001: A lock could not be obtained due to a > deadlock. derby.log is attached for reference. Yes, I saw that such lock errors too after about 18 hours of so. I am not yet sure they have anything to do with the patch changes, but I can't rule it out. I don't know the test app well enough. If you have any insight here which would indicate that locks should not occur under normal operation, what would that be? We could try to run the patch without the changes I made to ActiveLock, ConcurrentLockSet and LockSet, and see if the changes there have something to do with any change of behaviour.. > Do we need > InterruptException on synchronized (slaveRecoveryMonitor) in > LogToFile.recover(...)? I don't think I changed that part of the code. I see that an interrupt is just ignored there, so I think we can safely add a call to "InterruptStatus.setInterrupted" there as well, if that's what you mean. Generally I have just been concentrating on "normal operation" yet, so far I haven't yet looked into things like replication, backup, export/import, or even LOB/CLOB streams yet. I suspect there may be issues with interrupts happening during I/O in those areas as well. > Will change interrupt exception to statement > level rather than session level require more code changes? Such as > code in impl/store/raw/xact/* I don't know that for sure, but I suspected it might, the cleanup on session termination just seemed safer and in line with what we already did for interrupt exceptions. More investigation needed. > I was just thinking statement level is more user friendly to > application users. Yes, I see that point. If we decide that's the way we would ultimately like to go, I think I will still defer that to a next increment (maybe a new JIRA issue?) of this work just to keep things simpler for now.
          Hide
          Dag H. Wanvik added a comment -

          Running a micro benchmark to test the effect for adding InterruptStatus.throwIf to the APIs
          were a bit inconclusive. With the latest patch, using the attached MicroAPITest.java (ca 6-7 runs), I saw a ca 2.5 & degradation of performance with just calls to ResultSet.next for a database that fits in memory.

          However, just adding the InterruptStatus.throwIf calls to the API without the rest of the patch seemed to have no significant effect, so perhaps there are other parts of the patch that had made the operation somewhat slower, I'll try to incrementally remove parts of the patch and see what I find.

          Show
          Dag H. Wanvik added a comment - Running a micro benchmark to test the effect for adding InterruptStatus.throwIf to the APIs were a bit inconclusive. With the latest patch, using the attached MicroAPITest.java (ca 6-7 runs), I saw a ca 2.5 & degradation of performance with just calls to ResultSet.next for a database that fits in memory. However, just adding the InterruptStatus.throwIf calls to the API without the rest of the patch seemed to have no significant effect, so perhaps there are other parts of the patch that had made the operation somewhat slower, I'll try to incrementally remove parts of the patch and see what I find.
          Hide
          Knut Anders Hatlen added a comment -

          I've also tried the micro benchmark, but got somewhat different results. In 20 runs with a clean trunk and 20 runs with the derby-4741-nio-container+log+waits+locks+throws.diff patch in random order, clean trunk was on average 8.5% faster than the patched version (3.64 s/query vs 3.95 s/query).

          I ran the experiment with insane jars on OpenSolaris 2010.05 snv_134b and Java 1.6.0_18, and I had upped derby.storage.pageCacheSize to 12500.

          Show
          Knut Anders Hatlen added a comment - I've also tried the micro benchmark, but got somewhat different results. In 20 runs with a clean trunk and 20 runs with the derby-4741-nio-container+log+waits+locks+throws.diff patch in random order, clean trunk was on average 8.5% faster than the patched version (3.64 s/query vs 3.95 s/query). I ran the experiment with insane jars on OpenSolaris 2010.05 snv_134b and Java 1.6.0_18, and I had upped derby.storage.pageCacheSize to 12500.
          Hide
          Lily Wei added a comment -

          Thanks for the important finding. The result is hard on issues to keep Derby stand out from other Java database. I humbly believe Derby will keep on shining if performance can keep on improving. Do you have any suggestion in turn of fixing thread interrupts and keep performance no less than 3% slower than current clean trunk?

          Show
          Lily Wei added a comment - Thanks for the important finding. The result is hard on issues to keep Derby stand out from other Java database. I humbly believe Derby will keep on shining if performance can keep on improving. Do you have any suggestion in turn of fixing thread interrupts and keep performance no less than 3% slower than current clean trunk?
          Hide
          Knut Anders Hatlen added a comment -

          Oops... Clean trunk wasn't 8.5% faster than the patched version. It was 7.8% faster. What I should have said, was that the patched version was 8.5% slower than clean trunk.

          I ran the micro benchmark through a profiler, but I didn't see anything there to explain a ~8% degradation. I did see more time spent in ThreadLocal.get() (no surprise there...), but only enough to explain one or two percent, roughly what Dag saw. I'll try to dig a little more.

          Show
          Knut Anders Hatlen added a comment - Oops... Clean trunk wasn't 8.5% faster than the patched version. It was 7.8% faster. What I should have said, was that the patched version was 8.5% slower than clean trunk. I ran the micro benchmark through a profiler, but I didn't see anything there to explain a ~8% degradation. I did see more time spent in ThreadLocal.get() (no surprise there...), but only enough to explain one or two percent, roughly what Dag saw. I'll try to dig a little more.
          Hide
          Dag H. Wanvik added a comment - - edited

          Thanks for doing more measurements, Knut! It seems that even in the best case, the movePostion operation is negatively impacted with 1-2% due to the ThreadLocal access, which is not desirable.

          Knut mentioned that ContextService is already based on a ThreadLocal. I checked around a bit and found that we might be able to get rid of the InterruptStatus thread local check in the API methods in this way:

          If we were able to retrieve the lcc at the point where we detect the interrupts and save this fact in the lcc at that time, the cost of accessing a ThreadLocal variable would not be incurred unless AN INTERRUPT ACTUALLY HAPPENED.

          In the API methods, lcc is already available, and checking a boolean flag in the lcc instead of calling InterruptStatus#throwIf (which accesses a thread local) would be much cheaper. I see we already sometimes do dig out the lcc in store level code, e.g. in store.access.DiskHashtable, in this way:

          LanguageConnectionContext lcc = (LanguageConnectionContext)
          ContextService.getContextOrNull( // this call accesses a ThreadLocal
          LanguageConnectionContext.CONTEXT_ID);

          I'll look into this approach a bit.

          Show
          Dag H. Wanvik added a comment - - edited Thanks for doing more measurements, Knut! It seems that even in the best case, the movePostion operation is negatively impacted with 1-2% due to the ThreadLocal access, which is not desirable. Knut mentioned that ContextService is already based on a ThreadLocal. I checked around a bit and found that we might be able to get rid of the InterruptStatus thread local check in the API methods in this way: If we were able to retrieve the lcc at the point where we detect the interrupts and save this fact in the lcc at that time, the cost of accessing a ThreadLocal variable would not be incurred unless AN INTERRUPT ACTUALLY HAPPENED. In the API methods, lcc is already available, and checking a boolean flag in the lcc instead of calling InterruptStatus#throwIf (which accesses a thread local) would be much cheaper. I see we already sometimes do dig out the lcc in store level code, e.g. in store.access.DiskHashtable, in this way: LanguageConnectionContext lcc = (LanguageConnectionContext) ContextService.getContextOrNull( // this call accesses a ThreadLocal LanguageConnectionContext.CONTEXT_ID); I'll look into this approach a bit.
          Hide
          Dag H. Wanvik added a comment -

          The approach seems to work. This should take care of the performance worries for using InterruptStatus#throwIf in API methods.

          Sometimes there is no lcc, e.g when creating a new database; we could then fall back on the dedicated thread local variable, cf. during EmbedConnection constructor's call to createDatabase.

          Another note: undo is still vulnerable, e.g. an interrupt during rollback, usage of StorageRandomAccessFile in cf. Scan.java, cf.

          ------------ BEGIN SHUTDOWN ERROR STACK -------------

          ERROR XSLA3: Log Corrupted, has invalid data in the log stream.
          at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:279)
          at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Scan.java:216)
          at org.apache.derby.impl.store.raw.log.FileLogger.undo(FileLogger.java:939)
          at org.apache.derby.impl.store.raw.xact.Xact.abort(Xact.java:949)
          at org.apache.derby.impl.store.raw.xact.XactContext.cleanupOnError(XactContext.java:119)
          at org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(ContextManager.java:333)
          at org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(TransactionResourceImpl.java:419)
          at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:337)
          at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2277)
          at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
          at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1325)
          at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1683)
          at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:305)
          at InterruptTest$WorkerThread.run(InterruptTest.java:263)
          Caused by: java.io.InterruptedIOException
          at java.io.RandomAccessFile.read(Native Method)
          at java.io.RandomAccessFile.readInt(RandomAccessFile.java:721)
          at java.io.RandomAccessFile.readLong(RandomAccessFile.java:758)
          at org.apache.derby.impl.store.raw.log.Scan.getNextRecordBackward(Scan.java:394)
          at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Scan.java:204)

          (roll-back happens here due to the test trying to insert a duplicate after a commit has been interrupted)

          Show
          Dag H. Wanvik added a comment - The approach seems to work. This should take care of the performance worries for using InterruptStatus#throwIf in API methods. Sometimes there is no lcc, e.g when creating a new database; we could then fall back on the dedicated thread local variable, cf. during EmbedConnection constructor's call to createDatabase. Another note: undo is still vulnerable, e.g. an interrupt during rollback, usage of StorageRandomAccessFile in cf. Scan.java, cf. ------------ BEGIN SHUTDOWN ERROR STACK ------------- ERROR XSLA3: Log Corrupted, has invalid data in the log stream. at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:279) at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Scan.java:216) at org.apache.derby.impl.store.raw.log.FileLogger.undo(FileLogger.java:939) at org.apache.derby.impl.store.raw.xact.Xact.abort(Xact.java:949) at org.apache.derby.impl.store.raw.xact.XactContext.cleanupOnError(XactContext.java:119) at org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(ContextManager.java:333) at org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(TransactionResourceImpl.java:419) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:337) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2277) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1325) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1683) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:305) at InterruptTest$WorkerThread.run(InterruptTest.java:263) Caused by: java.io.InterruptedIOException at java.io.RandomAccessFile.read(Native Method) at java.io.RandomAccessFile.readInt(RandomAccessFile.java:721) at java.io.RandomAccessFile.readLong(RandomAccessFile.java:758) at org.apache.derby.impl.store.raw.log.Scan.getNextRecordBackward(Scan.java:394) at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Scan.java:204) (roll-back happens here due to the test trying to insert a duplicate after a commit has been interrupted)
          Hide
          Dag H. Wanvik added a comment -

          And we are vulnerable during (re)boot:

          java.sql.SQLException: Failed to start database 'interrupttest' with class loader sun.misc.Launcher$AppClassLoader@2bbd86, see the next exception for details.
          :
          Caused by: java.sql.SQLException: Failed to start database 'interrupttest' with class loader sun.misc.Launcher$AppClassLoader@2bbd86, see the next exception for details.
          :
          Caused by: java.sql.SQLException: Java exception: ': java.io.InterruptedIOException'.
          :
          Caused by: java.io.InterruptedIOException
          at java.io.RandomAccessFile.readBytes(Native Method)
          at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
          at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
          at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
          at org.apache.derby.impl.store.raw.log.LogToFile.readControlFile(LogToFile.java:2693)
          at org.apache.derby.impl.store.raw.log.LogToFile.boot(LogToFile.java:3398)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:572)
          at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
          at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.bootLogFactory(BaseDataFileFactory.java:1747)
          at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.setRawStoreFactory(BaseDataFileFactory.java:1235)
          at org.apache.derby.impl.store.raw.RawStore.boot(RawStore.java:223)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:572)
          at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
          at org.apache.derby.impl.store.access.RAMAccessManager.boot(RAMAccessManager.java:1019)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:572)
          at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
          at org.apache.derby.impl.db.BasicDatabase.bootStore(BasicDatabase.java:760)
          at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:193)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333)
          at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(BaseMonitor.java:1857)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(BaseMonitor.java:1723)
          at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(BaseMonitor.java:1601)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(BaseMonitor.java:1020)
          at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Monitor.java:550)
          at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(EmbedConnection.java:2679)

          Show
          Dag H. Wanvik added a comment - And we are vulnerable during (re)boot: java.sql.SQLException: Failed to start database 'interrupttest' with class loader sun.misc.Launcher$AppClassLoader@2bbd86, see the next exception for details. : Caused by: java.sql.SQLException: Failed to start database 'interrupttest' with class loader sun.misc.Launcher$AppClassLoader@2bbd86, see the next exception for details. : Caused by: java.sql.SQLException: Java exception: ': java.io.InterruptedIOException'. : Caused by: java.io.InterruptedIOException at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.derby.impl.store.raw.log.LogToFile.readControlFile(LogToFile.java:2693) at org.apache.derby.impl.store.raw.log.LogToFile.boot(LogToFile.java:3398) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:572) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.bootLogFactory(BaseDataFileFactory.java:1747) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.setRawStoreFactory(BaseDataFileFactory.java:1235) at org.apache.derby.impl.store.raw.RawStore.boot(RawStore.java:223) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:572) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427) at org.apache.derby.impl.store.access.RAMAccessManager.boot(RAMAccessManager.java:1019) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:572) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427) at org.apache.derby.impl.db.BasicDatabase.bootStore(BasicDatabase.java:760) at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:193) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:2020) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:333) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(BaseMonitor.java:1857) at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(BaseMonitor.java:1723) at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(BaseMonitor.java:1601) at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(BaseMonitor.java:1020) at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Monitor.java:550) at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(EmbedConnection.java:2679)
          Hide
          Dag H. Wanvik added a comment -

          I feel pretty confident the main approach will improve things by now, so I'll upload a final version of the experimental patch, and the start producing incremental patches for commit roughly in the following order:

          a) hooks in the APIs to resurrect interrupt flags before we return to the user's app on the basis of info collected during operation, the new util class InterruptStatus, which is a collection of methods to save away and resurrect interrupt flags, and throw an interrupt SQLexception. I intend to not throw any exception if the API method ran to completion, just resurrect the flag.
          b) the modified RAFContainer4 stuff for reading and writing db cache pages safely with NIO.
          c) the modified DirRandomAccessFile4 stuff + log retries for safe operation of log writing
          d) stop execution if intr seen within a batch, when checking query timeout, and when seeing intr while waiting for locks
          e) make undo safe
          f) time permitting, make boot and recovery interrupt safe.

          Please let me know if there any objections or suggestions to this plan! Whether all this gets ready for 10.7 is a possible issue, but hopefully we should not behave worse the before at any step.

          Show
          Dag H. Wanvik added a comment - I feel pretty confident the main approach will improve things by now, so I'll upload a final version of the experimental patch, and the start producing incremental patches for commit roughly in the following order: a) hooks in the APIs to resurrect interrupt flags before we return to the user's app on the basis of info collected during operation, the new util class InterruptStatus, which is a collection of methods to save away and resurrect interrupt flags, and throw an interrupt SQLexception. I intend to not throw any exception if the API method ran to completion, just resurrect the flag. b) the modified RAFContainer4 stuff for reading and writing db cache pages safely with NIO. c) the modified DirRandomAccessFile4 stuff + log retries for safe operation of log writing d) stop execution if intr seen within a batch, when checking query timeout, and when seeing intr while waiting for locks e) make undo safe f) time permitting, make boot and recovery interrupt safe. Please let me know if there any objections or suggestions to this plan! Whether all this gets ready for 10.7 is a possible issue, but hopefully we should not behave worse the before at any step.
          Hide
          Dag H. Wanvik added a comment -

          A final version of the experimental patch, derby-4741-all+lenient+resurrect uploaded.
          Main changes:

          • use lcc to store interrupt status for a session thread. If not available fall back on a thread local.
          • resurrect the interrupt flag before returning to user app always, if we throw or not (see next item)
          • don't throw if an API operation ran to completion without trouble, even if we did see an interrupt flag

          The ad hoc InterruptTest and JUnit Derby151Test should run OK on both Solaris and Windows now.
          The internal behavior is somewhat different since the JRE on Solaris throws InterrupedIOException in more cases, cf. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4385444.

          Show
          Dag H. Wanvik added a comment - A final version of the experimental patch, derby-4741-all+lenient+resurrect uploaded. Main changes: use lcc to store interrupt status for a session thread. If not available fall back on a thread local. resurrect the interrupt flag before returning to user app always, if we throw or not (see next item) don't throw if an API operation ran to completion without trouble, even if we did see an interrupt flag The ad hoc InterruptTest and JUnit Derby151Test should run OK on both Solaris and Windows now. The internal behavior is somewhat different since the JRE on Solaris throws InterrupedIOException in more cases, cf. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4385444 .
          Hide
          Dag H. Wanvik added a comment - - edited

          Note to self: I have found a problem with the interrupt recovery of
          RAFContainer4. Its call to openContainer needs to be protected by the
          monitor on FileContainer#allocCache, because opening a container
          evetually leads to a call on AllocationCache#reset. AllocCache javadoc
          states that the the callers need to synchronize themselves since it is
          itself not MT safe.

          Threads inside RAFContainer4#

          {readPage, writePage}

          do not necessarily
          own this monitor when recovery is attempted.

          I did once see a race condition due to this. In the race condition, a
          thread was trying to write to a new page and got a array out of bounds
          exception inside AllocationCache.validate (numExtents was suddenly back to 0)
          because another thread was doing interrupt recovery by calling
          RAFContainer4#recoverContainerAfterInterrupt ->
          openContainer -> ... -> AllocationCache#reset (unprotected).

          [edit add]:
          Simply enveloping recoverContainerAfterInterrupt's call to openContainer in synchronized(allocCache) won't work: can lead to deadlock.

          Show
          Dag H. Wanvik added a comment - - edited Note to self: I have found a problem with the interrupt recovery of RAFContainer4. Its call to openContainer needs to be protected by the monitor on FileContainer#allocCache, because opening a container evetually leads to a call on AllocationCache#reset. AllocCache javadoc states that the the callers need to synchronize themselves since it is itself not MT safe. Threads inside RAFContainer4# {readPage, writePage} do not necessarily own this monitor when recovery is attempted. I did once see a race condition due to this. In the race condition, a thread was trying to write to a new page and got a array out of bounds exception inside AllocationCache.validate (numExtents was suddenly back to 0) because another thread was doing interrupt recovery by calling RAFContainer4#recoverContainerAfterInterrupt -> openContainer -> ... -> AllocationCache#reset (unprotected). [edit add] : Simply enveloping recoverContainerAfterInterrupt's call to openContainer in synchronized(allocCache) won't work: can lead to deadlock.
          Hide
          Knut Anders Hatlen added a comment -

          I ran MicroAPITest with derby-4741-all+lenient+resurrect.diff (20 runs with each configuration, insane jars, pageCacheSize=12500). Now I only see a 1% degradation with the patch (p-value=0.07, according to the student t calculator at http://www.bio.miami.edu/rob/Students_t.html), so that definitely looks like a good improvement from the previous patch. I also modified the test so that it reused existing test data instead of creating it on every run. Then I didn't see any degradation. In fact, the results with the patch were 0.5% better than trunk when I ran the test that way.

          I also saw something similar with the previous patch. I saw an 8% degradation with the unmodified test, but the degradation was only about 2% with the modified test. This is also consistent with the findings Dag posted in the comment dated 15/Oct/10 12:06 PM, where he saw that the full patch gave a 2.5% degradation, whereas if he only applied the parts of the patch that were actually exercised during the data collection phase of the test, he didn't see any degradation.

          I don't know why the changes in the test setup code alters the result this way. The part of the test where we collect results should exercise the exact same code path with the two variants of the test. This reminds me of the effects we saw in DERBY-4233, which were attributed to JIT at that time. I think the theory at that time was something like this: The just-in-time compiler collects statistics before it compiles the code and uses the statistics to optimize the generated native code. The more time we spend in the test setup code, the more the statistics will be biased towards the setup code. For this test, that could mean that the JIT-compiled native code is optimized for the insert code path, whereas we're really only interested in the select code path in the test. Furthermore, since the patch changes code on the insert code path, the collected statistics and the chosen optimizations may change, and that may ultimately affect how the select code is optimized, even if the select code path barely has been touched by the patch.

          Show
          Knut Anders Hatlen added a comment - I ran MicroAPITest with derby-4741-all+lenient+resurrect.diff (20 runs with each configuration, insane jars, pageCacheSize=12500). Now I only see a 1% degradation with the patch (p-value=0.07, according to the student t calculator at http://www.bio.miami.edu/rob/Students_t.html ), so that definitely looks like a good improvement from the previous patch. I also modified the test so that it reused existing test data instead of creating it on every run. Then I didn't see any degradation. In fact, the results with the patch were 0.5% better than trunk when I ran the test that way. I also saw something similar with the previous patch. I saw an 8% degradation with the unmodified test, but the degradation was only about 2% with the modified test. This is also consistent with the findings Dag posted in the comment dated 15/Oct/10 12:06 PM, where he saw that the full patch gave a 2.5% degradation, whereas if he only applied the parts of the patch that were actually exercised during the data collection phase of the test, he didn't see any degradation. I don't know why the changes in the test setup code alters the result this way. The part of the test where we collect results should exercise the exact same code path with the two variants of the test. This reminds me of the effects we saw in DERBY-4233 , which were attributed to JIT at that time. I think the theory at that time was something like this: The just-in-time compiler collects statistics before it compiles the code and uses the statistics to optimize the generated native code. The more time we spend in the test setup code, the more the statistics will be biased towards the setup code. For this test, that could mean that the JIT-compiled native code is optimized for the insert code path, whereas we're really only interested in the select code path in the test. Furthermore, since the patch changes code on the insert code path, the collected statistics and the chosen optimizations may change, and that may ultimately affect how the select code is optimized, even if the select code path barely has been touched by the patch.
          Hide
          Lily Wei added a comment -

          I can confirm Knut's number for MicroAPITest .java in turn of 1-3% degradation or some case better with the derby-4741-all+lenient+resurrect.diff patch. Thanks for all the great work, Dag. A lot of customers will benefit from this. Wow!!!

          It seems like we will throw an error back to the caller indicate they are the one generating the interrupt and we will throw an error in the case of an interrupt during a lock wait with this new approach. Will you elaborate on that in turn of the new patch behavior please? I am still having trouble understand the data flow perspective and would like to see some write out in turn of what exact error I will see from Derby .

          This is one of the comments you wrote to me. Sorry for getting back to you promptly.
          >> Do we need InterruptException on synchronized (slaveRecoveryMonitor) in
          >> LogToFile.recover(...)?
          >I don't think I changed that part of the code. I see that an interrupt
          >is just ignored there, so I think we can safely add a call to
          >"InterruptStatus.setInterrupted" there as well, if that's what you
          >mean.
          Yes, this is what I mean. And, I add it to my client space and run some tests. It seems not harming anything. I know you are concentrating on "normal operation" and it is really a lot of work. Will we add a note somewhere about the code not cover for this interrupt handling? Such as: backup,
          export/import, LOB/CLOB streams and replication for during I/O in those areas as well.

          I run Suites.all on windows and it is looking better than ever!!!

          Show
          Lily Wei added a comment - I can confirm Knut's number for MicroAPITest .java in turn of 1-3% degradation or some case better with the derby-4741-all+lenient+resurrect.diff patch. Thanks for all the great work, Dag. A lot of customers will benefit from this. Wow!!! It seems like we will throw an error back to the caller indicate they are the one generating the interrupt and we will throw an error in the case of an interrupt during a lock wait with this new approach. Will you elaborate on that in turn of the new patch behavior please? I am still having trouble understand the data flow perspective and would like to see some write out in turn of what exact error I will see from Derby . This is one of the comments you wrote to me. Sorry for getting back to you promptly. >> Do we need InterruptException on synchronized (slaveRecoveryMonitor) in >> LogToFile.recover(...)? >I don't think I changed that part of the code. I see that an interrupt >is just ignored there, so I think we can safely add a call to >"InterruptStatus.setInterrupted" there as well, if that's what you >mean. Yes, this is what I mean. And, I add it to my client space and run some tests. It seems not harming anything. I know you are concentrating on "normal operation" and it is really a lot of work. Will we add a note somewhere about the code not cover for this interrupt handling? Such as: backup, export/import, LOB/CLOB streams and replication for during I/O in those areas as well. I run Suites.all on windows and it is looking better than ever!!!
          Hide
          Dag H. Wanvik added a comment -

          Hei Lily, thanks for your continued interest in this issue!

          > Will you elaborate on that in turn of the new patch behavior please?

          My present thinking is that the goal should be:
          a) the best user experience will be that for operations
          (technically JDBC API calls), which detect an interrupt, but
          manage to retry and continue the operation, and for which we do
          not cut the operation short, because there is no natural
          point at which to interrupt it, or we deem it to be so short in
          duration that we might as well see it through, we:

          • do not throw an exception
          • restore the interruped status flag just prior to return from
            the operation (JDBC API call)
            b) for operations that we actually cut short due to the interrupt,
            see current enumeration below, we
          • throw SQLException 08000, session level (we might change this
            severity later as discussed)
          • also resurrect the interrupt status flag just prior to return
            from the operation (JDBC API call)
            c) for operations, which detect an interrupt, but
            manage to retry and continue the operation, but eventually throw
            an SQLException (not 08000, some other), we do as in a):
          • restore the interruped status flag just prior to return from
            the operation (JDBC API call)

          d) Operations that we may currently cut short if we see an interrupt
          early enough:

          • executeBatch (we check between each statement)
          • executeQuery (we check in the present
            BasicNoPutResultSetImpl#checkCancellationFlag in the same
            location that we check for timeout currently)
          • while waiting for a lock, cf. LockSet.java,
            ConcurrentLockSet.java

          Further suggestions for good places to check and throw are welcome

          > .. synchronized (slaveRecoveryMonitor): And, I add it to my client
          > space and run some tests. It seems not harming anything.

          Good, we'll add it then.

          > Will we add a note somewhere about the code not cover for this
          > interrupt handling?

          For the time being, we can make notes in this JIRA issue, but
          eventually I suspect we will want to make a release note to describe
          the improvements, and what remains to be done. You are right that
          there is a lot of work, it is also hard to test, unfortunately.

          Show
          Dag H. Wanvik added a comment - Hei Lily, thanks for your continued interest in this issue! > Will you elaborate on that in turn of the new patch behavior please? My present thinking is that the goal should be: a) the best user experience will be that for operations (technically JDBC API calls), which detect an interrupt, but manage to retry and continue the operation, and for which we do not cut the operation short, because there is no natural point at which to interrupt it, or we deem it to be so short in duration that we might as well see it through, we: do not throw an exception restore the interruped status flag just prior to return from the operation (JDBC API call) b) for operations that we actually cut short due to the interrupt, see current enumeration below, we throw SQLException 08000, session level (we might change this severity later as discussed) also resurrect the interrupt status flag just prior to return from the operation (JDBC API call) c) for operations, which detect an interrupt, but manage to retry and continue the operation, but eventually throw an SQLException (not 08000, some other), we do as in a): restore the interruped status flag just prior to return from the operation (JDBC API call) d) Operations that we may currently cut short if we see an interrupt early enough: executeBatch (we check between each statement) executeQuery (we check in the present BasicNoPutResultSetImpl#checkCancellationFlag in the same location that we check for timeout currently) while waiting for a lock, cf. LockSet.java, ConcurrentLockSet.java Further suggestions for good places to check and throw are welcome > .. synchronized (slaveRecoveryMonitor): And, I add it to my client > space and run some tests. It seems not harming anything. Good, we'll add it then. > Will we add a note somewhere about the code not cover for this > interrupt handling? For the time being, we can make notes in this JIRA issue, but eventually I suspect we will want to make a release note to describe the improvements, and what remains to be done. You are right that there is a lot of work, it is also hard to test, unfortunately.
          Hide
          Dag H. Wanvik added a comment - - edited

          Uploading a first rev of a patch for part a): derby-4741-a-01-api-interruptstatus
          Regressions ran ok modulo a hang in BootLockTest.

          This patch contains the new helper class InterruptStatus and inserts calls to
          restoreIntrFlagIfSeen in before API methods' return and in the exception handling (TransactionResourceImpl#handleException).

          Execution of EmbedStatement#executeBatch check of interrupts and throws 08000 if seen between each statement in the batch.

          Note: Still, the machinery of InterruptStatus isn't used to save any interrupts, that follows in a later patch, so this patch doesn't change behavior. The focus here is on the correct placement of calls to
          restoreIntrFlagIfSeen in the API. Reviews appreciated!

          ------------------------ Patch details:

          • iapi.util.InterruptStatus: new helper class for keeping track of detected and
            postponed interrupts. See also its Javadoc.
          • LanguageConnectionContext
            GenericLanguageConnectionContext:

          Methods to set/reset threads interrupt status flag to allow us to clear the thread's actual flag temporarily, and to reset it when we return to the user application, possibly by throwing an interrupted SQLException (SQL state 08000) if we cut something short. It also contains methods to save and retrieve the stack frame at the point an interrupt was detected. This is used when we throw the interrupt exception.

          • TransactionResourceImpl: resurrect the thread's interrupt status flag on SQL error (called from handleException). Use lcc if available, else thread local variable.
          • EmbedBlob, for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return
          • EmbedResultSet: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return
          • EmbedStatment: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return

          In executeBatch, use throwIf between each statement in the batch: if we saw an interrupt, stop execution of batch. Note that if we do throw, restoreIntrFlagIfSeen will be called in the finally block, but it is idempotent, so no harm is done. It is necessary in case we saw an interrupt in the last statement of the batch.

          • EmbedDatabaseMetaData: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return
          • EmbedPreparedStatement: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return
          • EmbedConnection.java: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return

          For createDatabase, use InterruptStatus.restoreIntrFlagIfSeen() which inspects the thread local variable. This should catch I/O interrupts seen during the database creation phase.

          Special handling in #close to make sure we access lcc while it is still available.

          Show
          Dag H. Wanvik added a comment - - edited Uploading a first rev of a patch for part a): derby-4741-a-01-api-interruptstatus Regressions ran ok modulo a hang in BootLockTest. This patch contains the new helper class InterruptStatus and inserts calls to restoreIntrFlagIfSeen in before API methods' return and in the exception handling (TransactionResourceImpl#handleException). Execution of EmbedStatement#executeBatch check of interrupts and throws 08000 if seen between each statement in the batch. Note: Still, the machinery of InterruptStatus isn't used to save any interrupts, that follows in a later patch, so this patch doesn't change behavior. The focus here is on the correct placement of calls to restoreIntrFlagIfSeen in the API. Reviews appreciated! ------------------------ Patch details: iapi.util.InterruptStatus: new helper class for keeping track of detected and postponed interrupts. See also its Javadoc. LanguageConnectionContext GenericLanguageConnectionContext: Methods to set/reset threads interrupt status flag to allow us to clear the thread's actual flag temporarily, and to reset it when we return to the user application, possibly by throwing an interrupted SQLException (SQL state 08000) if we cut something short. It also contains methods to save and retrieve the stack frame at the point an interrupt was detected. This is used when we throw the interrupt exception. TransactionResourceImpl: resurrect the thread's interrupt status flag on SQL error (called from handleException). Use lcc if available, else thread local variable. EmbedBlob, for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return EmbedResultSet: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return EmbedStatment: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return In executeBatch, use throwIf between each statement in the batch: if we saw an interrupt, stop execution of batch. Note that if we do throw, restoreIntrFlagIfSeen will be called in the finally block, but it is idempotent, so no harm is done. It is necessary in case we saw an interrupt in the last statement of the batch. EmbedDatabaseMetaData: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return EmbedPreparedStatement: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return EmbedConnection.java: for methods that call handleException inside synchronized on connection: use InterruptStatus.restoreIntrFlagIfSeen(lcc) before return For createDatabase, use InterruptStatus.restoreIntrFlagIfSeen() which inspects the thread local variable. This should catch I/O interrupts seen during the database creation phase. Special handling in #close to make sure we access lcc while it is still available.
          Hide
          Lily Wei added a comment -

          Hi Dag:
          Thank you so much for the write out. All a, b, c, d points are very clear and straight forward. Personally, I like the fact we are checking interrupt at executeBatch, executeQuery, BasicNoPutResultSetImpl(same as timeout) and waiting for lock area. Thank you for add the synchronized (slaveRecoveryMonitor) and adding all the necessary release notes information.

          I apply the derby-4741-a-01-api-interruptstatus. All the suites.all passed and timing for MicroAPITest is close to clean trunk. It also passed Derby151Test with interrupt check. Mailjdbc is still having deadlock issue. However, like you said, it might or might not cause by the new change. I notice some of the interrupt code is in slight different places in Embed*.java with local initial null value. And, some of interrupt handling code are not in this patch than the previous patch. I am assuming those code are handling machinery of InterruptStatus that save interrupts and it will come in later patch.

          I view this patch as it will benefit a lot of customers and love to help out if I can. Thank you so much for working so hard on this.

          Show
          Lily Wei added a comment - Hi Dag: Thank you so much for the write out. All a, b, c, d points are very clear and straight forward. Personally, I like the fact we are checking interrupt at executeBatch, executeQuery, BasicNoPutResultSetImpl(same as timeout) and waiting for lock area. Thank you for add the synchronized (slaveRecoveryMonitor) and adding all the necessary release notes information. I apply the derby-4741-a-01-api-interruptstatus. All the suites.all passed and timing for MicroAPITest is close to clean trunk. It also passed Derby151Test with interrupt check. Mailjdbc is still having deadlock issue. However, like you said, it might or might not cause by the new change. I notice some of the interrupt code is in slight different places in Embed*.java with local initial null value. And, some of interrupt handling code are not in this patch than the previous patch. I am assuming those code are handling machinery of InterruptStatus that save interrupts and it will come in later patch. I view this patch as it will benefit a lot of customers and love to help out if I can. Thank you so much for working so hard on this.
          Hide
          Dag H. Wanvik added a comment - - edited

          Hi Lily, thanks for test driving the patch I really appreciate your help looking at this work. Having another set of eyes on this is good, the change are a bit cross cutting and affect much existing code, so it's good to have someone double checking the changes. Two comments:

          • if the mailjdbc test still deadlocks, I doubt it can be due to this patch, since it doesn't really change any behavior, it just installs the machinery for the behavior changes to come in later patches, as you presumed.
          • the difference is EmbedBlob with local initial null is just some cruft from an experiment I did with trying to place restoreIntrFlagIfSeen in a finally block. This didn't work out since, error handling may close the connection and with it the lcc, so the check would happen too late. (The current method is to check before return and in the error handling code itself). I'll revert that change, thanks for noticing.

          My main concern with this patch currently is to verify that I have found all the ways control can return to the user app after Derby code has had control, and possibly set the interrupt status flag. A case in point is the Blob methods that don't require a connection, e.g. EmbedBlob#getBytes if the lob is materialized, we call control.read: if we see an interrupt inside the read, the flag would get stored in a thread local variable, since there is no lcc, so I'll I need to add a call to no-args version of restoreIntrFlagIfSeen() there. EmbedClob needs analysis, too.

          Show
          Dag H. Wanvik added a comment - - edited Hi Lily, thanks for test driving the patch I really appreciate your help looking at this work. Having another set of eyes on this is good, the change are a bit cross cutting and affect much existing code, so it's good to have someone double checking the changes. Two comments: if the mailjdbc test still deadlocks, I doubt it can be due to this patch, since it doesn't really change any behavior, it just installs the machinery for the behavior changes to come in later patches, as you presumed. the difference is EmbedBlob with local initial null is just some cruft from an experiment I did with trying to place restoreIntrFlagIfSeen in a finally block. This didn't work out since, error handling may close the connection and with it the lcc, so the check would happen too late. (The current method is to check before return and in the error handling code itself). I'll revert that change, thanks for noticing. My main concern with this patch currently is to verify that I have found all the ways control can return to the user app after Derby code has had control, and possibly set the interrupt status flag. A case in point is the Blob methods that don't require a connection, e.g. EmbedBlob#getBytes if the lob is materialized, we call control.read: if we see an interrupt inside the read, the flag would get stored in a thread local variable, since there is no lcc, so I'll I need to add a call to no-args version of restoreIntrFlagIfSeen() there. EmbedClob needs analysis, too.
          Hide
          Dag H. Wanvik added a comment -

          Uploading derby-4741-a-02-api-interruptstatus, an improved version of
          a-01.

          I went over the usages of restoreIntrFlagIfSeen in the API and
          reevaluated where it is possible to supply the lcc and where it is not
          safe: the intention is that if we stored the interrupt status flag in
          a lcc we should be recover it from there before returning to the user
          app and, similarly, if we stored the interrupt status flag in a thread
          local variable because an lcc was not available, we should recover it
          from there. It turns out it is not always obvious to determine where to
          look in the API methods.

          API methods vary with repect to whether they:

          a) synchronize on the connection or not, cf. possible bugs
          see DERBY-4129, even when accessing store.

          b) call setupContextStack or not (this is what lets us find the
          lcc down in store using ContextService.getContextOrNull)

          c) clean up when errors occur by calling handleInterrupt(t), cf my
          comment on DERBY-4129, e.g. EmbedClob. The cleanup can lead to
          a connection closing if severity is high, setting its lcc to
          null.

          d) even if an API method doesn't call setupContextStack, it may be
          called indirectly from a context that does set it up, cf.
          part of a trigger on an insert.

          When a connection (or database) is being closed down, we need to be
          careful so that we catch the lcc while it is still available. When
          cleaning up errors in TransactionResourceImpl#handleException we do
          not directly know if an lcc is available or not.

          The upshot of all this is that unless we are sure we have a valid lcc
          (i.e. we see statically that setupContextStack has been called), we
          instead call restoreIntrFlagIfSeen without an lcc argument. The
          zero-arg variant calls getContextOrNull to see if a lcc is available
          for the thread, falling back to the thread local if not. So, in
          effect, the lcc variant is an optimization, but it does cover the
          methods for which performance could be an issue, I think.

          The new patch covers LOB ands CLOB as well, and is ready for review.

          Regressions ran ok.

          Show
          Dag H. Wanvik added a comment - Uploading derby-4741-a-02-api-interruptstatus, an improved version of a-01. I went over the usages of restoreIntrFlagIfSeen in the API and reevaluated where it is possible to supply the lcc and where it is not safe: the intention is that if we stored the interrupt status flag in a lcc we should be recover it from there before returning to the user app and, similarly, if we stored the interrupt status flag in a thread local variable because an lcc was not available, we should recover it from there. It turns out it is not always obvious to determine where to look in the API methods. API methods vary with repect to whether they: a) synchronize on the connection or not, cf. possible bugs see DERBY-4129 , even when accessing store. b) call setupContextStack or not (this is what lets us find the lcc down in store using ContextService.getContextOrNull) c) clean up when errors occur by calling handleInterrupt(t), cf my comment on DERBY-4129 , e.g. EmbedClob. The cleanup can lead to a connection closing if severity is high, setting its lcc to null. d) even if an API method doesn't call setupContextStack, it may be called indirectly from a context that does set it up, cf. part of a trigger on an insert. When a connection (or database) is being closed down, we need to be careful so that we catch the lcc while it is still available. When cleaning up errors in TransactionResourceImpl#handleException we do not directly know if an lcc is available or not. The upshot of all this is that unless we are sure we have a valid lcc (i.e. we see statically that setupContextStack has been called), we instead call restoreIntrFlagIfSeen without an lcc argument. The zero-arg variant calls getContextOrNull to see if a lcc is available for the thread, falling back to the thread local if not. So, in effect, the lcc variant is an optimization, but it does cover the methods for which performance could be an issue, I think. The new patch covers LOB ands CLOB as well, and is ready for review. Regressions ran ok.
          Hide
          Dag H. Wanvik added a comment -

          Uploading version a-03, minor and clerical changes only relative to a-02.

          Show
          Dag H. Wanvik added a comment - Uploading version a-03, minor and clerical changes only relative to a-02.
          Hide
          Lily Wei added a comment -

          When running MicroAPITest against a-03 patch, it is 45-48 seconds vs clean trunk 27-28 second. I noticed it is not using lcc but the earlier slower approach. Are you seeing the same slow windows performance again a-03 patch.

          Show
          Lily Wei added a comment - When running MicroAPITest against a-03 patch, it is 45-48 seconds vs clean trunk 27-28 second. I noticed it is not using lcc but the earlier slower approach. Are you seeing the same slow windows performance again a-03 patch.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for test driving it, Lily. The new patch does use the lcc if available, it just falls back on the TL approach if not, so the performance drop was unexpected to me. I'll have a look, thanks for noticing.

          Show
          Dag H. Wanvik added a comment - Thanks for test driving it, Lily. The new patch does use the lcc if available, it just falls back on the TL approach if not, so the performance drop was unexpected to me. I'll have a look, thanks for noticing.
          Hide
          Dag H. Wanvik added a comment -

          Lily, did you with a sane or insane build? I see a large difference also (Solaris though) with sane (due to an expensive assert), but with insane is it very small (~<1%).

          Show
          Dag H. Wanvik added a comment - Lily, did you with a sane or insane build? I see a large difference also (Solaris though) with sane (due to an expensive assert), but with insane is it very small (~<1%).
          Hide
          Lily Wei added a comment -

          Hi Dag: That is interesting. I was using sane build. Shouldn't sane build have similar performance as insane or they are usually having difference?

          Show
          Lily Wei added a comment - Hi Dag: That is interesting. I was using sane build. Shouldn't sane build have similar performance as insane or they are usually having difference?
          Hide
          Knut Anders Hatlen added a comment -

          The sane jars are not intended for production use, so I wouldn't worry too much about a performance loss that we only see with them. Unless it's getting so slow that it takes an unreasonably long time to run the regression tests with them, that is.

          I'm guessing that the extra performance loss seen with sane jars is caused by the assert in InterruptStatus.restoreIntrFlagIfSeen(LanguageConnectionContext), which makes us look up the lcc in the context stack. Twice, actually. Perhaps it would help a little to store the result from getContextOrNull() in a local variable so that we get away with a single call. Or perhaps it's the unconditional string concatenation for the message text that's costly. In any case, that assert is the only place I can see that the patch would have a different performance impact on sane and insane.

          Show
          Knut Anders Hatlen added a comment - The sane jars are not intended for production use, so I wouldn't worry too much about a performance loss that we only see with them. Unless it's getting so slow that it takes an unreasonably long time to run the regression tests with them, that is. I'm guessing that the extra performance loss seen with sane jars is caused by the assert in InterruptStatus.restoreIntrFlagIfSeen(LanguageConnectionContext), which makes us look up the lcc in the context stack. Twice, actually. Perhaps it would help a little to store the result from getContextOrNull() in a local variable so that we get away with a single call. Or perhaps it's the unconditional string concatenation for the message text that's costly. In any case, that assert is the only place I can see that the patch would have a different performance impact on sane and insane.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Dag,

          I went through the a-03-api patch, and I couldn't spot anything that
          you've missed.

          There were quite a number of places to check the interrupt
          status. More than what I had expected. But I cannot think of a better
          place to short-circuit it, so... Except perhaps in EmbedConnection. It
          looks to me like most of the calls to restoreIntrFlagIfSeen() happen
          within try blocks right after calls to setupContextStack(). Did you
          consider it as an option to move the restoreIntrFlagIfSeen() calls
          into restoreContextStack(), which is already called in the finally
          clauses at those places? That may make the code more contained, but
          perhaps the LCC isn't easily available there?

          Apart from that, I only have some minor comments to the code in the
          InterruptStatus class. Very minor, so feel free to ignore.

          • throwIf: Unnecessary return statement.
          • stackTrace: The initialValue() override does the same as the default
            implementation. So by just using a vanilla ThreadLocal object, we'd
            get the exact same behaviour, and one less class in derby.jar +
            permgen. (If the footprint issue is important, the same might be
            said for receivedInterruptDuringIO if we change its usage from a
            TRUE/FALSE check to a null/non-null check. Actually, the presence of
            a stack trace is probably a good enough indicator in itself.)
          • Instead of creating an exception, fetching its stack trace and
            storing the trace, perhaps it would be easier just to store the
            exception? Then throwIf() could throw the stored exception instead
            of creating a new one and modifying its stack trace.
          Show
          Knut Anders Hatlen added a comment - Hi Dag, I went through the a-03-api patch, and I couldn't spot anything that you've missed. There were quite a number of places to check the interrupt status. More than what I had expected. But I cannot think of a better place to short-circuit it, so... Except perhaps in EmbedConnection. It looks to me like most of the calls to restoreIntrFlagIfSeen() happen within try blocks right after calls to setupContextStack(). Did you consider it as an option to move the restoreIntrFlagIfSeen() calls into restoreContextStack(), which is already called in the finally clauses at those places? That may make the code more contained, but perhaps the LCC isn't easily available there? Apart from that, I only have some minor comments to the code in the InterruptStatus class. Very minor, so feel free to ignore. throwIf: Unnecessary return statement. stackTrace: The initialValue() override does the same as the default implementation. So by just using a vanilla ThreadLocal object, we'd get the exact same behaviour, and one less class in derby.jar + permgen. (If the footprint issue is important, the same might be said for receivedInterruptDuringIO if we change its usage from a TRUE/FALSE check to a null/non-null check. Actually, the presence of a stack trace is probably a good enough indicator in itself.) Instead of creating an exception, fetching its stack trace and storing the trace, perhaps it would be easier just to store the exception? Then throwIf() could throw the stored exception instead of creating a new one and modifying its stack trace.
          Hide
          Dag H. Wanvik added a comment -

          Knut is right, that was the expensive assert I mentioned before (in sane builds). Since this is a micro benchmark that mainly measures the CPU neeeded to execute ResultSet.next when data is cached in memory, the effect of the assert shows up heavily. Running a profiling I saw the string concatenation there being dominant. No, I did not notice the regression test become slower, but if the turns out to be an issue, it is easy enough to comment out the asset. For now, I'll change the assert code to just call getContextOrNull once.

          Show
          Dag H. Wanvik added a comment - Knut is right, that was the expensive assert I mentioned before (in sane builds). Since this is a micro benchmark that mainly measures the CPU neeeded to execute ResultSet.next when data is cached in memory, the effect of the assert shows up heavily. Running a profiling I saw the string concatenation there being dominant. No, I did not notice the regression test become slower, but if the turns out to be an issue, it is easy enough to comment out the asset. For now, I'll change the assert code to just call getContextOrNull once.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for looking at the patch, Knut!

          > Except perhaps in EmbedConnection. It looks to me like most of the calls to restoreIntrFlagIfSeen() happen within try blocks right after calls to setupContextStack(). Did you consider it as an option to move the restoreIntrFlagIfSeen() calls into restoreContextStack(), which is already called in the finally clauses at those places?

          I did try to put the call into the finally,although not into restoreContextStack. I found that didn't always work, since during error handling before we get to the finally block, calls to handleException(throwable) will depending on severity, potentially take down the connection (and database), causing us to lose the lcc. Hence I had to move it back to just before normal return, and use the no-args variant of restoreIntrFlagIfSeen handling inside the error handling code (cf changes in TransactionResourceImpl#handleException).

          > - throwIf: Unnecessary return statement.

          Agreed, will fix that

          > - stackTrace: The initialValue() override does the same as the default implementation. So by just using a vanilla ThreadLocal object, we'd get the exact same behaviour, and one less class in derby.jar + permgen.

          Agreed.

          > (If the footprint issue is important, the same might be said for receivedInterruptDuringIO if we change its usage from a TRUE/FALSE check to a null/non-null check. Actually, the presence of a stack trace is probably a good enough indicator in itself.)

          Yes and yes. Originally I used the stacktrace just for debugging with sane, I think.

          > - Instead of creating an exception, fetching its stack trace and storing the trace, perhaps it would be easier just to store the exception? Then throwIf() could throw the stored exception instead of creating a new one and modifying its stack trace.

          That's good simplication, yes.

          Show
          Dag H. Wanvik added a comment - Thanks for looking at the patch, Knut! > Except perhaps in EmbedConnection. It looks to me like most of the calls to restoreIntrFlagIfSeen() happen within try blocks right after calls to setupContextStack(). Did you consider it as an option to move the restoreIntrFlagIfSeen() calls into restoreContextStack(), which is already called in the finally clauses at those places? I did try to put the call into the finally,although not into restoreContextStack. I found that didn't always work, since during error handling before we get to the finally block, calls to handleException(throwable) will depending on severity, potentially take down the connection (and database), causing us to lose the lcc. Hence I had to move it back to just before normal return, and use the no-args variant of restoreIntrFlagIfSeen handling inside the error handling code (cf changes in TransactionResourceImpl#handleException). > - throwIf: Unnecessary return statement. Agreed, will fix that > - stackTrace: The initialValue() override does the same as the default implementation. So by just using a vanilla ThreadLocal object, we'd get the exact same behaviour, and one less class in derby.jar + permgen. Agreed. > (If the footprint issue is important, the same might be said for receivedInterruptDuringIO if we change its usage from a TRUE/FALSE check to a null/non-null check. Actually, the presence of a stack trace is probably a good enough indicator in itself.) Yes and yes. Originally I used the stacktrace just for debugging with sane, I think. > - Instead of creating an exception, fetching its stack trace and storing the trace, perhaps it would be easier just to store the exception? Then throwIf() could throw the stored exception instead of creating a new one and modifying its stack trace. That's good simplication, yes.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a revised patch, derby-4741-a-04-api-interruptstatus, which takes care of the comments. hopefully. Rerunning regressions.

          Show
          Dag H. Wanvik added a comment - Uploading a revised patch, derby-4741-a-04-api-interruptstatus, which takes care of the comments. hopefully. Rerunning regressions.
          Hide
          Dag H. Wanvik added a comment -

          Regressions ran ok. Will commit this version soon, if no misgivings emerge.

          Show
          Dag H. Wanvik added a comment - Regressions ran ok. Will commit this version soon, if no misgivings emerge.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the changes and the explanations, Dag. I have no further comments to the a-04-api patch.

          One observation: After looking at the changes in EmbedBlob.getBytes(long,int) and EmbedBlob.position(), I understand why some people advocate that all methods should have a single exit point.

          Show
          Knut Anders Hatlen added a comment - Thanks for the changes and the explanations, Dag. I have no further comments to the a-04-api patch. One observation: After looking at the changes in EmbedBlob.getBytes(long,int) and EmbedBlob.position(), I understand why some people advocate that all methods should have a single exit point.
          Hide
          Lily Wei added a comment -

          I test drive a-04-api patch as well. The sane MicroAPITest result is very good. I got result that is 5s - 7s. The insane result is similar to a-03-api which is around 45s--52s vs 27s-28s on clean trunk. However, like we discussed previous for MicroAPITest benchmark is only focus on CPU result and the slowdown is mostly on ASSERT. With so many things to take care of with 10.7 release. +1 to commit.

          Show
          Lily Wei added a comment - I test drive a-04-api patch as well. The sane MicroAPITest result is very good. I got result that is 5s - 7s. The insane result is similar to a-03-api which is around 45s--52s vs 27s-28s on clean trunk. However, like we discussed previous for MicroAPITest benchmark is only focus on CPU result and the slowdown is mostly on ASSERT. With so many things to take care of with 10.7 release. +1 to commit.
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-4741-a-04-api-interruptstatus as svn 1030630.

          Show
          Dag H. Wanvik added a comment - Committed derby-4741-a-04-api-interruptstatus as svn 1030630.
          Hide
          Dag H. Wanvik added a comment - - edited

          Uploading patch derby-4741-b-01-nio. This patch contains changes to
          FileContainer/RAFContainer/RAFContainer4 to allow page IO to recover
          from interrupts seen in RAFContainer4 (the NIO[1] version
          specialization of RAFContainer). Additionally, some waits are changed
          to retry after interrupts (yes, there are more of those in other
          classes, that's for a later patch, but since I was in there anyway..)

          The main thrust of the patch is in RAFContainer4. Upon seeing
          exceptions from IO (readPage/writePage), a thread will determine if it
          is
          a) the thread that caused the channel to beclome closed, or
          b) it suffered an exceptions because another thread caused the
          channel to beclome closed.

          If a), the thread will recover by reopening the container before
          reattempting IO, cf. the method recoverContainerAfterInterrupt. If b)
          the thread will yield, wait and retry until the container has been
          recovered, or a minute has passed. After a minute, we give up and
          throw FILE_IO_INTERRUPTED. I chose a minute somewhat arbitrarily, it
          is hopefully sufficient .

          The recovery thread and other waiting threads normally synchronize the
          operations via variables protected by the channelCleanupMonitor
          monitor (but see use of volatile and associated comments).

          The retry logic happens in one of three places, in increasing
          closeness to RAFContainer4:

          a) if the thread owns the monitor on allocCache and it is not the
          one that is doing recovery, it will back up to FileContainer
          before retrying the IO, since it has to release the lock on
          allocCache, because that monitor is needed by the thread doing
          recovery. Cf. changes in FileContainer. If the thread holds
          allocCache, it goes into stealth mode in readPage/WritePage:
          this means that the recovery thread doesn't see it. We can't
          let a stealth mode thread grab channelCleanupMonitor, as this
          could cause a dead-lock. In c) the recovery thread keeps track
          of other threads, waking them up when recovery is done.

          b) if the thread owns the monitor on "this", again the thread goes
          into stealth mode. If it is not the one to clean up, it will
          back up to RAFContainer#clean's loop, since it has to release
          the lock on "this", because that monitor is also needed by the
          thread doing recovery. This only ever happens for readPage (not
          writePage), cf. call to getEmbryonicPage in WriteRAFHeader
          called from RAFContainer#clean.

          c) in other cases, the thread will call awaitRestoreChannel before
          reattempting IO.

          The logic of RAFContainer4#getEmbryonicPage has been folded into
          readPage in order for it to be covered by the interrupt treatement.

          The test Derby151Test has been removed, since it is no longer
          relevant. A new test, InterruptResilienceTest, fails without the
          patch, but should work with the patch in place. It interrupts the
          single app thread before doing IO, which will trip NIO and cause
          recovery. Even though we have only one app thread, the test still
          revealed concurrency issues during the development of this patch,
          since RawStoreThread will also do IO on the container.

          A note on the implementation: Sun Java NIO code (including the
          latest 1.6) has a bug which are worked around by the patch in two
          places, cf. code comments:

          • readFull/writeFull: if a thread has been interrupted,
            occasionally it closes the channel but does not throw,
            cf. http://bugs.sun.com/view_bug.do?bug_id=6979009
          • readPage/WritePage: sometimes the interrupted thread throws
            the wrong exception: AsynchronousCloseException instead of
            the more specialized ClosedByInterruptException.

          [1] Note: DirRandomAccessFile4 also contain use of NIO, I'll address
          that in a later patch.

          I will also be adding more tests. A problem for making tests for
          interrupts is to get coverage without making tests unstable, so I'll
          wait until I have more holes plugged.

          Regressions worked earlier tonight in a slightly different version, rerunning now.

          Show
          Dag H. Wanvik added a comment - - edited Uploading patch derby-4741-b-01-nio. This patch contains changes to FileContainer/RAFContainer/RAFContainer4 to allow page IO to recover from interrupts seen in RAFContainer4 (the NIO [1] version specialization of RAFContainer). Additionally, some waits are changed to retry after interrupts (yes, there are more of those in other classes, that's for a later patch, but since I was in there anyway..) The main thrust of the patch is in RAFContainer4. Upon seeing exceptions from IO (readPage/writePage), a thread will determine if it is a) the thread that caused the channel to beclome closed, or b) it suffered an exceptions because another thread caused the channel to beclome closed. If a), the thread will recover by reopening the container before reattempting IO, cf. the method recoverContainerAfterInterrupt. If b) the thread will yield, wait and retry until the container has been recovered, or a minute has passed. After a minute, we give up and throw FILE_IO_INTERRUPTED. I chose a minute somewhat arbitrarily, it is hopefully sufficient . The recovery thread and other waiting threads normally synchronize the operations via variables protected by the channelCleanupMonitor monitor (but see use of volatile and associated comments). The retry logic happens in one of three places, in increasing closeness to RAFContainer4: a) if the thread owns the monitor on allocCache and it is not the one that is doing recovery, it will back up to FileContainer before retrying the IO, since it has to release the lock on allocCache, because that monitor is needed by the thread doing recovery. Cf. changes in FileContainer. If the thread holds allocCache, it goes into stealth mode in readPage/WritePage: this means that the recovery thread doesn't see it. We can't let a stealth mode thread grab channelCleanupMonitor, as this could cause a dead-lock. In c) the recovery thread keeps track of other threads, waking them up when recovery is done. b) if the thread owns the monitor on "this", again the thread goes into stealth mode. If it is not the one to clean up, it will back up to RAFContainer#clean's loop, since it has to release the lock on "this", because that monitor is also needed by the thread doing recovery. This only ever happens for readPage (not writePage), cf. call to getEmbryonicPage in WriteRAFHeader called from RAFContainer#clean. c) in other cases, the thread will call awaitRestoreChannel before reattempting IO. The logic of RAFContainer4#getEmbryonicPage has been folded into readPage in order for it to be covered by the interrupt treatement. The test Derby151Test has been removed, since it is no longer relevant. A new test, InterruptResilienceTest, fails without the patch, but should work with the patch in place. It interrupts the single app thread before doing IO, which will trip NIO and cause recovery. Even though we have only one app thread, the test still revealed concurrency issues during the development of this patch, since RawStoreThread will also do IO on the container. A note on the implementation: Sun Java NIO code (including the latest 1.6) has a bug which are worked around by the patch in two places, cf. code comments: readFull/writeFull: if a thread has been interrupted, occasionally it closes the channel but does not throw, cf. http://bugs.sun.com/view_bug.do?bug_id=6979009 readPage/WritePage: sometimes the interrupted thread throws the wrong exception: AsynchronousCloseException instead of the more specialized ClosedByInterruptException. [1] Note: DirRandomAccessFile4 also contain use of NIO, I'll address that in a later patch. I will also be adding more tests. A problem for making tests for interrupts is to get coverage without making tests unstable, so I'll wait until I have more holes plugged. Regressions worked earlier tonight in a slightly different version, rerunning now.
          Hide
          Lily Wei added a comment -

          Hi Dag: Thank you so much for the continue effort to fix this issue. It will benefit a lot of customers and this is truly a hard bug/feature to fix. I am still trying to process the effect of FileContainer/RAFContainer/RAFContainer4. I don't quite understand it yet. The performance result for MicroAPITest is 5s-7s for insane build and 45s-47s for sane build. That's great. Regression can have failure on my windows environment: testPartialRowRTStats(Java exception: 'PermGen space: java.lang.OutOfMemoryError) or
          testReplicationt_Local_1_Indexing(java.sql.SQLException: DERBY SQL
          error: SQLCODE: -1, SQLSTATE: XJ041, SQLERRMC: Failed to create database 'c:\de
          rby5\trunk\testall\db_master\wombat', see the next exception for details.::SQLST
          ATE: XBM0JDirectory c:\derby5\trunk\testall\db_master\C:\derby5\trunk\testall\db
          _master\wombat already exists. I run Suites.all three times, they all have similar failures. I can not reproduce if I run the test individually. Will this patch affect memory consumption? Will the timing change on cf cause java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ041, SQLERRMC? I am not 100% sure. Or, they were just existing issues. For more multi thread purpose, I add to InterrupResilienceTest suite. If it is in the right direction, I can keep explore to make more other positive interrupt cases. Thanks Krisitan for the AbstractMTThread class. I am so happy to work on this. Yeah!!!

          Show
          Lily Wei added a comment - Hi Dag: Thank you so much for the continue effort to fix this issue. It will benefit a lot of customers and this is truly a hard bug/feature to fix. I am still trying to process the effect of FileContainer/RAFContainer/RAFContainer4. I don't quite understand it yet. The performance result for MicroAPITest is 5s-7s for insane build and 45s-47s for sane build. That's great. Regression can have failure on my windows environment: testPartialRowRTStats(Java exception: 'PermGen space: java.lang.OutOfMemoryError) or testReplicationt_Local_1_Indexing(java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ041, SQLERRMC: Failed to create database 'c:\de rby5\trunk\testall\db_master\wombat', see the next exception for details.::SQLST ATE: XBM0JDirectory c:\derby5\trunk\testall\db_master\C:\derby5\trunk\testall\db _master\wombat already exists. I run Suites.all three times, they all have similar failures. I can not reproduce if I run the test individually. Will this patch affect memory consumption? Will the timing change on cf cause java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ041, SQLERRMC? I am not 100% sure. Or, they were just existing issues. For more multi thread purpose, I add to InterrupResilienceTest suite. If it is in the right direction, I can keep explore to make more other positive interrupt cases. Thanks Krisitan for the AbstractMTThread class. I am so happy to work on this. Yeah!!!
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Lily, for looking at the latest patch. I agree the code is a bit tricky, so there could be bugs. Feel free to ask detailed questions, I'll try to answer.

          I did not see any issue with testPartialRowRTStats when I ran the regressions, nor would I expect this patch for increase perm gen space since it doesn't add new classes. But we have seen that perm gen space can be an issue when running regressions. I think I recently upped my pergen space setting for suitesAll to 192m. I have also seen failures in the replication tests, there are some instabilities left, but I don't think they are due to this patch.

          You are right that the current test, InterruptResilienceTest, is pretty thin right now, but it does exercise the RAF recovery machinery with two concurrent threads (user thread and the raw store daemon).

          I intend to make more tests; the trick is to make sure the tests actually exercise the code paths we want: just shooting interrupts at threads can land anywhere during execution, and Derby doesn't stand up to it in all places yet. I plan to use the SanityManager + a debug flag to simulate interrupt at selected places in the code. If you have suggestions for good tests, please let's have them

          Show
          Dag H. Wanvik added a comment - Thanks, Lily, for looking at the latest patch. I agree the code is a bit tricky, so there could be bugs. Feel free to ask detailed questions, I'll try to answer. I did not see any issue with testPartialRowRTStats when I ran the regressions, nor would I expect this patch for increase perm gen space since it doesn't add new classes. But we have seen that perm gen space can be an issue when running regressions. I think I recently upped my pergen space setting for suitesAll to 192m. I have also seen failures in the replication tests, there are some instabilities left, but I don't think they are due to this patch. You are right that the current test, InterruptResilienceTest, is pretty thin right now, but it does exercise the RAF recovery machinery with two concurrent threads (user thread and the raw store daemon). I intend to make more tests; the trick is to make sure the tests actually exercise the code paths we want: just shooting interrupts at threads can land anywhere during execution, and Derby doesn't stand up to it in all places yet. I plan to use the SanityManager + a debug flag to simulate interrupt at selected places in the code. If you have suggestions for good tests, please let's have them
          Hide
          Lily Wei added a comment - - edited

          For testing idea, SanityManager + a debug flag sound good. I am trying to picture how it can cover all the code we try to hit. I am assuming we want to hit code like RAFContainer4.readpage. The simple insert case to Blob and Clob data type can be hitting it like the InterruptResilienceTest does. Will BlobSetMethodTest or LobStreamTest do the trick? A far stretch will be utilizing CorruptRandomAccessFile. I am thinking out loud here.

          Show
          Lily Wei added a comment - - edited For testing idea, SanityManager + a debug flag sound good. I am trying to picture how it can cover all the code we try to hit. I am assuming we want to hit code like RAFContainer4.readpage. The simple insert case to Blob and Clob data type can be hitting it like the InterruptResilienceTest does. Will BlobSetMethodTest or LobStreamTest do the trick? A far stretch will be utilizing CorruptRandomAccessFile. I am thinking out loud here.
          Hide
          Knut Anders Hatlen added a comment -

          I had a look at the b-01-nio patch, and it looked correct to me. The
          stealth mode part wasn't entirely obvious, but the comments were
          helpful and the code appears to be doing the right thing.

          As far as I can see, the code will behave the same way as before if no
          interrupt is detected. So if there are problems with the code, they
          will hopefully be limited to the case where the thread is
          interrupted. There's an additional container-wide synchronization on
          every page read/write, though, which may possibly have a negative
          effect on performance in multi-threaded environments. If that turns
          out to be an issue, would it be possible to change the code to only
          check restoreChannelInProgress after an exception has been thrown,
          similar to what we currently do in stealth mode?

          Some minor issues found in the patch:

          1) Thread.holdsLock() is a static method, but all calls to it are done
          on Thread.currentThread().

          2) I think the checks for SQLState.INTERRUPT_DETECTED should use
          getMessageId() instead of getSQLState(), since the SQLState interface,
          despite its name, contains message ids and not SQLStates (although the
          message id and the SQLState happen to be the same for this error).

          And since INTERRUPT_DETECTED is only intended to be used internally,
          would it make sense, instead of giving INTERRUPT_DETECTED a proper
          SQLState and also an entry in messages.xml (which will be picked up
          and translated unnecessarily later), to create a subclass of
          StandardException for this error? Something like what's done with
          NoSpaceOnPage, I mean. That could also make the catch blocks used to
          handle this condition a little bit tidier.

          3) Javadoc for RAFContainer4.readPage() should explain how the offset
          parameter is supposed to be used (-1 for normal reads, actual offset
          for embryonic pages)

          4) The variable "whence" in awaitRestoreChannel() does not seem to be
          used. Nor is its "pageNumber" parameter, I think.

          Show
          Knut Anders Hatlen added a comment - I had a look at the b-01-nio patch, and it looked correct to me. The stealth mode part wasn't entirely obvious, but the comments were helpful and the code appears to be doing the right thing. As far as I can see, the code will behave the same way as before if no interrupt is detected. So if there are problems with the code, they will hopefully be limited to the case where the thread is interrupted. There's an additional container-wide synchronization on every page read/write, though, which may possibly have a negative effect on performance in multi-threaded environments. If that turns out to be an issue, would it be possible to change the code to only check restoreChannelInProgress after an exception has been thrown, similar to what we currently do in stealth mode? Some minor issues found in the patch: 1) Thread.holdsLock() is a static method, but all calls to it are done on Thread.currentThread(). 2) I think the checks for SQLState.INTERRUPT_DETECTED should use getMessageId() instead of getSQLState(), since the SQLState interface, despite its name, contains message ids and not SQLStates (although the message id and the SQLState happen to be the same for this error). And since INTERRUPT_DETECTED is only intended to be used internally, would it make sense, instead of giving INTERRUPT_DETECTED a proper SQLState and also an entry in messages.xml (which will be picked up and translated unnecessarily later), to create a subclass of StandardException for this error? Something like what's done with NoSpaceOnPage, I mean. That could also make the catch blocks used to handle this condition a little bit tidier. 3) Javadoc for RAFContainer4.readPage() should explain how the offset parameter is supposed to be used (-1 for normal reads, actual offset for embryonic pages) 4) The variable "whence" in awaitRestoreChannel() does not seem to be used. Nor is its "pageNumber" parameter, I think.
          Hide
          Knut Anders Hatlen added a comment -

          And a note to others who consider reviewing the patch: I found that it was easier to see exactly what was being changed by running svn diff with the extra option "--extensions=-bw" to disregard whitespace changes in the parts of the code where an extra level of nesting was added.

          Show
          Knut Anders Hatlen added a comment - And a note to others who consider reviewing the patch: I found that it was easier to see exactly what was being changed by running svn diff with the extra option "--extensions=-bw" to disregard whitespace changes in the parts of the code where an extra level of nesting was added.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for the review, Knut. Uploading a new version of the patch,
          derby-4741-b-02-nio, details below, rerunning regressions.

          • In RAFContainer4, folded the cases for pageNumber == -1 (call from
            getEmbryonicPage) with the case Thread.holdsLock(this). This allows
            awaitRestoreChannel to throw the retry exception for all cases. I
            did both static and dynamic analysis to establish invariant, and
            added a sane check for it.

          > As far as I can see, the code will behave the same way as before if no
          > interrupt is detected. So if there are problems with the code, they
          > will hopefully be limited to the case where the thread is
          > interrupted. There's an additional container-wide synchronization on
          > every page read/write, though, which may possibly have a negative
          > effect on performance in multi-threaded environments. If that turns
          > out to be an issue, would it be possible to change the code to only
          > check restoreChannelInProgress after an exception has been thrown,
          > similar to what we currently do in stealth mode?

          Yes, I believe we could do that, since it should do no harm to attempt
          the IO even when recovery is in progress, because all calls to
          getChannel are synchronized on "this" already, so either the thread
          sees the old channel (closed), or the new reopened channel. I didn't
          make this change yet, though, since the monitor hold should
          short-lived (one boolean check and an integer increment) compared to
          the IO.

          • Thread.holdsLock(): done
          • Created a new subclass of StandardException:
            InterruptDetectedException, good suggestion, indeed cleaner!
          • Javadoc for RAFContainer4.readPage: done
          • The variable "whence": removed
          • Removed some commented out debugging cruft
          • Tuned the number of iterations in InterruptDetectedException to make
            sure we see a a concurrent thread (RawStoreDaemon) having to wait
            for cleanup before proceeding, at least on my box. Cf. the debug
            trace for derby.debug.true=RAF4Recovery, which I also added.
          Show
          Dag H. Wanvik added a comment - Thanks for the review, Knut. Uploading a new version of the patch, derby-4741-b-02-nio, details below, rerunning regressions. In RAFContainer4, folded the cases for pageNumber == -1 (call from getEmbryonicPage) with the case Thread.holdsLock(this). This allows awaitRestoreChannel to throw the retry exception for all cases. I did both static and dynamic analysis to establish invariant, and added a sane check for it. > As far as I can see, the code will behave the same way as before if no > interrupt is detected. So if there are problems with the code, they > will hopefully be limited to the case where the thread is > interrupted. There's an additional container-wide synchronization on > every page read/write, though, which may possibly have a negative > effect on performance in multi-threaded environments. If that turns > out to be an issue, would it be possible to change the code to only > check restoreChannelInProgress after an exception has been thrown, > similar to what we currently do in stealth mode? Yes, I believe we could do that, since it should do no harm to attempt the IO even when recovery is in progress, because all calls to getChannel are synchronized on "this" already, so either the thread sees the old channel (closed), or the new reopened channel. I didn't make this change yet, though, since the monitor hold should short-lived (one boolean check and an integer increment) compared to the IO. Thread.holdsLock(): done Created a new subclass of StandardException: InterruptDetectedException, good suggestion, indeed cleaner! Javadoc for RAFContainer4.readPage: done The variable "whence": removed Removed some commented out debugging cruft Tuned the number of iterations in InterruptDetectedException to make sure we see a a concurrent thread (RawStoreDaemon) having to wait for cleanup before proceeding, at least on my box. Cf. the debug trace for derby.debug.true=RAF4Recovery, which I also added.
          Hide
          Dag H. Wanvik added a comment - - edited

          Found that Derby uses interrupt to stop threads at shutdown[1]. The present logic in RAFContainer4 doesn't
          take this into account, with the consequence that an application thread got stuck waiting for
          a container recovery that never took place. We get an exception during execution of
          closeContainer();
          openContainer(currentIdentity);
          If this exception is not due to an interrupt, we throw FILE_IO_INTERRUPTED. Unfortunately, in this case,
          the present patch neglects to release any threads waiting for recovery to finish, so they are stuck waiting for "restoreChannelInProgress".

          Saw this during suitesAll.

          [1] BaseMonitor#shutdown > notifyAllActiveThreads and in DatabaseContextImpl#cleanupOnError>notifyAllActiveThreads

          Show
          Dag H. Wanvik added a comment - - edited Found that Derby uses interrupt to stop threads at shutdown [1] . The present logic in RAFContainer4 doesn't take this into account, with the consequence that an application thread got stuck waiting for a container recovery that never took place. We get an exception during execution of closeContainer(); openContainer(currentIdentity); If this exception is not due to an interrupt, we throw FILE_IO_INTERRUPTED. Unfortunately, in this case, the present patch neglects to release any threads waiting for recovery to finish, so they are stuck waiting for "restoreChannelInProgress". Saw this during suitesAll. [1] BaseMonitor#shutdown > notifyAllActiveThreads and in DatabaseContextImpl#cleanupOnError >notifyAllActiveThreads
          Hide
          Lily Wei added a comment -

          This might sound wired. I am so happy you got this. I woke up to a hanging test for Suites.All and now I know why. Thank you so much for posting this and looking into this.

          Show
          Lily Wei added a comment - This might sound wired. I am so happy you got this. I woke up to a hanging test for Suites.All and now I know why. Thank you so much for posting this and looking into this.
          Hide
          Lily Wei added a comment -

          Add the following code in ClosedChannelException block in RAFContainer4.java->readPage and did not see any hang for Suites.All in Windows 7 enviornment:
          if (pageNumber != -1L)

          { // Recovery is in progress, wait for another interrupted thread // to clean up. awaitRestoreChannel(e, stealthMode); }

          else {
          // Called from this.getEmbryonicPage.

          // Retry handled in RAFContainer#clean. This is
          // necessary since recovery needs the monitor on "this"
          // held by clean when it calls getEmbryonicPage via
          // writeRAFHeader, so never call awaitRestoreChannel
          // here, since then might be stuck on
          // channelCleanupMonitor held by another thread
          // cleaning up, while still hold onto "this".

          if (stealthMode)

          { // don't touch threadsInPageIO }

          else {
          synchronized (channelCleanupMonitor)

          { threadsInPageIO--; }

          }
          }

          Show
          Lily Wei added a comment - Add the following code in ClosedChannelException block in RAFContainer4.java->readPage and did not see any hang for Suites.All in Windows 7 enviornment: if (pageNumber != -1L) { // Recovery is in progress, wait for another interrupted thread // to clean up. awaitRestoreChannel(e, stealthMode); } else { // Called from this.getEmbryonicPage. // Retry handled in RAFContainer#clean. This is // necessary since recovery needs the monitor on "this" // held by clean when it calls getEmbryonicPage via // writeRAFHeader, so never call awaitRestoreChannel // here, since then might be stuck on // channelCleanupMonitor held by another thread // cleaning up, while still hold onto "this". if (stealthMode) { // don't touch threadsInPageIO } else { synchronized (channelCleanupMonitor) { threadsInPageIO--; } } }
          Hide
          Dag H. Wanvik added a comment - - edited

          Uploading patch b-03 which adds logic to RAFContainer4 to handle
          properly the fact that a container can't be reopened:

          // In case the recovering thread can't successfully recover the container,
          // it will throw, so other waiting threads need to give up as well. This
          // can happen at shutdown time when interrupts are used to stop threads.
          private boolean giveUpIO = false;
          private final Object giveUpIOm = new Object(); // its monitor

          We make sure to notify any waiting threads when the recovering thread gives up and throws FILE_IO_INTERRUPTED. The waiting threads will next see giveUpIO == true, and give up as well.

          This should make sure that threads not doing the recovery, will detect that recovery will not happen, and throw as well. SuitesAll finished successfully.

          Show
          Dag H. Wanvik added a comment - - edited Uploading patch b-03 which adds logic to RAFContainer4 to handle properly the fact that a container can't be reopened: // In case the recovering thread can't successfully recover the container, // it will throw, so other waiting threads need to give up as well. This // can happen at shutdown time when interrupts are used to stop threads. private boolean giveUpIO = false; private final Object giveUpIOm = new Object(); // its monitor We make sure to notify any waiting threads when the recovering thread gives up and throws FILE_IO_INTERRUPTED. The waiting threads will next see giveUpIO == true, and give up as well. This should make sure that threads not doing the recovery, will detect that recovery will not happen, and throw as well. SuitesAll finished successfully.
          Hide
          Dag H. Wanvik added a comment -

          Lily, it seems you reintroduced some of the logic I folded into awaitRestoreChannel, but without throwing to get retry for the case pageNumber == 1 (throw INTERRUPT_DETECTED). I don't think this is correct, not sure why this made a difference. Can you try the latest patch on Windows 7?

          Show
          Dag H. Wanvik added a comment - Lily, it seems you reintroduced some of the logic I folded into awaitRestoreChannel, but without throwing to get retry for the case pageNumber == 1 (throw INTERRUPT_DETECTED). I don't think this is correct, not sure why this made a difference. Can you try the latest patch on Windows 7?
          Hide
          Dag H. Wanvik added a comment -

          I have now run regressions with patch b-03 on Windows Vista SP2 without errors. Since the 10.7 branch has been cut, I intend to commit this patch soon, and move on the the next piece of the puzzle.

          Show
          Dag H. Wanvik added a comment - I have now run regressions with patch b-03 on Windows Vista SP2 without errors. Since the 10.7 branch has been cut, I intend to commit this patch soon, and move on the the next piece of the puzzle.
          Hide
          Dag H. Wanvik added a comment -

          For reviewers of b-03, it may be interesting note that running InterruptResilienceTest with the flag -Dderby.debug.true=RAF4Recovery
          will likely give one or more instances of this trace in derby.log:

          DEBUG RAF4Recovery OUTPUT: derby.rawStoreDaemon thread needs to wait for container recovery: 0

          showing that this part of the multi threading synchronization is being exercised: The user that was interrupted is doing recovery on the channel so the
          thread printing this (i.e. RawStoreDaemon) has to wait for recovery to complete to be able to move on.

          Show
          Dag H. Wanvik added a comment - For reviewers of b-03, it may be interesting note that running InterruptResilienceTest with the flag -Dderby.debug.true=RAF4Recovery will likely give one or more instances of this trace in derby.log: DEBUG RAF4Recovery OUTPUT: derby.rawStoreDaemon thread needs to wait for container recovery: 0 showing that this part of the multi threading synchronization is being exercised: The user that was interrupted is doing recovery on the channel so the thread printing this (i.e. RawStoreDaemon) has to wait for recovery to complete to be able to move on.
          Hide
          Lily Wei added a comment -

          Hi Dag:
          Thank you so much for all the detail message. The code looks good to me. I don't see potential issue. When I run insane build, I got 5.8 second for MicroAPITest and 51.x second for sane build vs 5.07s(insane) and 47.xs(sane) build on clean trunk. It could be just Windows platform issue.
          +1 to check in to trunk.

          Show
          Lily Wei added a comment - Hi Dag: Thank you so much for all the detail message. The code looks good to me. I don't see potential issue. When I run insane build, I got 5.8 second for MicroAPITest and 51.x second for sane build vs 5.07s(insane) and 47.xs(sane) build on clean trunk. It could be just Windows platform issue. +1 to check in to trunk.
          Hide
          Dag H. Wanvik added a comment - - edited

          HI Lily! Thanks! I'll commit it soon after I have verified a debug trace for observing the effects of the giveUpIO logic.

          Show
          Dag H. Wanvik added a comment - - edited HI Lily! Thanks! I'll commit it soon after I have verified a debug trace for observing the effects of the giveUpIO logic.
          Hide
          Dag H. Wanvik added a comment -

          Uploading b-04, which cleans up and adds some more debug tracing to RAFContainer4.

          Show
          Dag H. Wanvik added a comment - Uploading b-04, which cleans up and adds some more debug tracing to RAFContainer4.
          Hide
          Dag H. Wanvik added a comment -

          Committed b-04 as svn 1038440. Note: I changed the debug trace flag to "RAF4" to accommodate an exisiting trace not related to recovery.

          Show
          Dag H. Wanvik added a comment - Committed b-04 as svn 1038440. Note: I changed the debug trace flag to "RAF4" to accommodate an exisiting trace not related to recovery.
          Hide
          Dag H. Wanvik added a comment - - edited

          This patch (derby-4741-c-01-nio) closes two corner cases I have
          observed when stress testing the RAFContainer4 recovery mechanism. It
          does some other small cleanups. Regressions ran OK.

          RAFContainer:

          If we receive an interrupt when the container is first being opened
          (i.e. during RAFContainer.run (OPEN_CONTAINER_ACTION) ->
          getEmbryonicPage), recovery will fail because currentIdentity needed
          in RAFContainer4#recoverContainerAfterInterrupt hasn't yet been
          set.

          RAFContainer4:

          If a stealthMode read is interrupted and is recovering the container,
          it erroneously increments threadsInPageIO just before exiting to retry
          IO. This leads to a break in the invariant that threadsInPageIO be 0
          when all threads are done, causing issue (hang) down the line.

          If, when we are reopening the container, the read being done during
          that operation (getEmbryonicPage), that stealth mode read will also
          lead to a (recursive) recovery. We have to catch this case by adding a
          "catch (InterruptDetectedException e)" just after the call to
          openContainer, not by testing the interrupt flag as presently done,
          since the recovery inside the recursive call to
          getEmbryonicPage/readPage will already have cleared the flag and done
          recovery.

          When giving up reopening the container for some reason, we also forgot
          to decrement threadsInPageIO.

          To guard against other hangs, I will make the while-true loops max out
          in all cases. But before I commit that change, it would be nice to see
          if this patch has any impact on DERBY-4920 (I suspect not). The reason
          I'd like to hold off with that is that since DERBY-4920 occurs during
          shutdown, that may mask the fact that recovery failed, cf. my comment
          https://issues.apache.org/jira/browse/DERBY-4920?focusedCommentId=12936016&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12936016
          So, I'd rather wait with that till I get DERBY-4920 out of the way.

          Show
          Dag H. Wanvik added a comment - - edited This patch (derby-4741-c-01-nio) closes two corner cases I have observed when stress testing the RAFContainer4 recovery mechanism. It does some other small cleanups. Regressions ran OK. RAFContainer: If we receive an interrupt when the container is first being opened (i.e. during RAFContainer.run (OPEN_CONTAINER_ACTION) -> getEmbryonicPage), recovery will fail because currentIdentity needed in RAFContainer4#recoverContainerAfterInterrupt hasn't yet been set. RAFContainer4: If a stealthMode read is interrupted and is recovering the container, it erroneously increments threadsInPageIO just before exiting to retry IO. This leads to a break in the invariant that threadsInPageIO be 0 when all threads are done, causing issue (hang) down the line. If, when we are reopening the container, the read being done during that operation (getEmbryonicPage), that stealth mode read will also lead to a (recursive) recovery. We have to catch this case by adding a "catch (InterruptDetectedException e)" just after the call to openContainer, not by testing the interrupt flag as presently done, since the recovery inside the recursive call to getEmbryonicPage/readPage will already have cleared the flag and done recovery. When giving up reopening the container for some reason, we also forgot to decrement threadsInPageIO. To guard against other hangs, I will make the while-true loops max out in all cases. But before I commit that change, it would be nice to see if this patch has any impact on DERBY-4920 (I suspect not). The reason I'd like to hold off with that is that since DERBY-4920 occurs during shutdown, that may mask the fact that recovery failed, cf. my comment https://issues.apache.org/jira/browse/DERBY-4920?focusedCommentId=12936016&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12936016 So, I'd rather wait with that till I get DERBY-4920 out of the way.
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-4741-c-01-nio as svn 1040086. Please let me know if you still see the hang of DERBY-4920 and, if possible, during execution which test it happens.

          Show
          Dag H. Wanvik added a comment - Committed derby-4741-c-01-nio as svn 1040086. Please let me know if you still see the hang of DERBY-4920 and, if possible, during execution which test it happens.
          Hide
          Kristian Waagan added a comment -

          Happened to look at parts of the new code, and have a few questions/comments on code in RAF4Container:
          o in writeFull the variables beforeOpen and beforeInterrupt are unused. Can they be removed?
          o the same issue in readFull.
          o would it make sense to make the monitor object channelCleanupMonitor final?
          o redundant call to Thread.currentThread when invoking holdsLock.
          o missing JavaDoc for recoverContainerAfterInterrupt

          A more theoretical question at the end.
          Is there a reason to explicitly define instance variables with the default value for the data type?
          I.e., int i = 0, boolean b = false.
          Just asking, as I find it annoying to step through these in the debugger

          Show
          Kristian Waagan added a comment - Happened to look at parts of the new code, and have a few questions/comments on code in RAF4Container: o in writeFull the variables beforeOpen and beforeInterrupt are unused. Can they be removed? o the same issue in readFull. o would it make sense to make the monitor object channelCleanupMonitor final? o redundant call to Thread.currentThread when invoking holdsLock. o missing JavaDoc for recoverContainerAfterInterrupt A more theoretical question at the end. Is there a reason to explicitly define instance variables with the default value for the data type? I.e., int i = 0, boolean b = false. Just asking, as I find it annoying to step through these in the debugger
          Hide
          Dag H. Wanvik added a comment - - edited

          Thanks for looking at the new code, Kristian, I'll make a new patch to cater for your comments.

          > o in writeFull the variables beforeOpen and beforeInterrupt are unused. Can they be removed?
          > o the same issue in readFull.

          Absolutely, debugging cruft. Thanks for spotting those.

          > o would it make sense to make the monitor object channelCleanupMonitor final?

          Absolutely.

          > o redundant call to Thread.currentThread when invoking holdsLock.

          Thanks, thought I had cleaned out those, but there were two left!

          > o missing JavaDoc for recoverContainerAfterInterrupt

          Added Javadoc for recoverContainerAfterInterrupt and awaitRestoreChannel.

          > A more theoretical question at the end. Is there a reason to
          > explicitly define instance variables with the default value for the
          > data type? I.e., int i = 0, boolean b = false.

          No. I tend to often use explicit initialization due to my paranoid
          streak I guess. I see the data flow analysis barks if you use a local
          variable uninitialized, although not for a class member, not clear to
          me why Java differentiates here.

          public class Foo {
          public boolean bb;

          public static void main(String[] args) {
          boolean b;

          if (args.length > 0 && args[0].equals("foo"))

          { b = true; }

          if (b)

          { // compiler barks System.err.println("b is true"); }

          Foo f = new Foo();
          if (f.bb)

          { // OK System.err.println("bb is true"); }

          }
          }

          I guess I have tended to treat them the same. I'll try to improve

          Show
          Dag H. Wanvik added a comment - - edited Thanks for looking at the new code, Kristian, I'll make a new patch to cater for your comments. > o in writeFull the variables beforeOpen and beforeInterrupt are unused. Can they be removed? > o the same issue in readFull. Absolutely, debugging cruft. Thanks for spotting those. > o would it make sense to make the monitor object channelCleanupMonitor final? Absolutely. > o redundant call to Thread.currentThread when invoking holdsLock. Thanks, thought I had cleaned out those, but there were two left! > o missing JavaDoc for recoverContainerAfterInterrupt Added Javadoc for recoverContainerAfterInterrupt and awaitRestoreChannel. > A more theoretical question at the end. Is there a reason to > explicitly define instance variables with the default value for the > data type? I.e., int i = 0, boolean b = false. No. I tend to often use explicit initialization due to my paranoid streak I guess. I see the data flow analysis barks if you use a local variable uninitialized, although not for a class member, not clear to me why Java differentiates here. public class Foo { public boolean bb; public static void main(String[] args) { boolean b; if (args.length > 0 && args [0] .equals("foo")) { b = true; } if (b) { // compiler barks System.err.println("b is true"); } Foo f = new Foo(); if (f.bb) { // OK System.err.println("bb is true"); } } } I guess I have tended to treat them the same. I'll try to improve
          Hide
          Dag H. Wanvik added a comment -

          Uploading derby-4741-kristians-01, running regressions.

          Show
          Dag H. Wanvik added a comment - Uploading derby-4741-kristians-01, running regressions.
          Hide
          Kristian Waagan added a comment -

          Thanks, Dag.

          > I guess I have tended to treat them the same. I'll try to improve

          Not sure there is something to improve here, I guess this is a question about personal preference.
          Explicit initialization of instance variables used to result in two assignments (first the default value for the type, and then the explicit value), which is kind of unnecessary if both assignments use the same value. I don't know if this is still the case.

          And I have become quite accustomed to use the "Step over" or "Run to cursor" in the debugger
          Finally, if the initial value isn't the default for the data type, then you have to assign the correct value anyway - either in the constructor or as an instance variable initializer (i.e. private int i = 3).

          Thanks for working on the interrupt issue! I think it is of great value for Derby to be able to handle interrupts better. This is especially important when/if other software products used together with Derby use Thread.interrupt() internally.

          Show
          Kristian Waagan added a comment - Thanks, Dag. > I guess I have tended to treat them the same. I'll try to improve Not sure there is something to improve here, I guess this is a question about personal preference. Explicit initialization of instance variables used to result in two assignments (first the default value for the type, and then the explicit value), which is kind of unnecessary if both assignments use the same value. I don't know if this is still the case. And I have become quite accustomed to use the "Step over" or "Run to cursor" in the debugger Finally, if the initial value isn't the default for the data type, then you have to assign the correct value anyway - either in the constructor or as an instance variable initializer (i.e. private int i = 3). Thanks for working on the interrupt issue! I think it is of great value for Derby to be able to handle interrupts better. This is especially important when/if other software products used together with Derby use Thread.interrupt() internally.
          Hide
          Dag H. Wanvik added a comment - - edited

          While I am looking for the Heisenbug in DERBY-4920, I am considering the next increment of this patch from the
          prototype patch regarding the issues seen when interrupt hits the logging machinery (mostly not NIO, except for the case of DirRandomAccessFile#sync).
          Now, it turns out that the vulnerability here only applies for Solaris. Other platforms' "non-NIO" IO is not interruptible, so for those other platforms there is no need to do anything special. For Solaris, the behavior can be tweaked to be impervious to interrupts as well using the special switch: -XX:-UseVMInterruptibleIO on JVMs >= 1.5.0_12. This behavior will be default in 1.7 also on Solaris[1].
          So the question becomes, would it be sufficient to advise users to run with this switch set, or should we make Derby tolerate
          the interrupts on Solaris here also?

          I do think we should be able to run correctly with the default settings on all relevant platforms, but I would appricate your opinions.

          [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4385444

          Show
          Dag H. Wanvik added a comment - - edited While I am looking for the Heisenbug in DERBY-4920 , I am considering the next increment of this patch from the prototype patch regarding the issues seen when interrupt hits the logging machinery (mostly not NIO, except for the case of DirRandomAccessFile#sync). Now, it turns out that the vulnerability here only applies for Solaris. Other platforms' "non-NIO" IO is not interruptible, so for those other platforms there is no need to do anything special. For Solaris, the behavior can be tweaked to be impervious to interrupts as well using the special switch: -XX:-UseVMInterruptibleIO on JVMs >= 1.5.0_12. This behavior will be default in 1.7 also on Solaris [1] . So the question becomes, would it be sufficient to advise users to run with this switch set, or should we make Derby tolerate the interrupts on Solaris here also? I do think we should be able to run correctly with the default settings on all relevant platforms, but I would appricate your opinions. [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4385444
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-4741-kristians-01 as svn 1043802 (suitesAll ran ok).

          Show
          Dag H. Wanvik added a comment - Committed derby-4741-kristians-01 as svn 1043802 (suitesAll ran ok).
          Hide
          Knut Anders Hatlen added a comment -

          > So the question becomes, would it be sufficient to advise users to run with this switch set, or should we make Derby tolerate
          > the interrupts on Solaris here also?

          How much code do you think will be needed to handle that case? Handling interrupts sent to the engine is somewhat of an edge case in itself. Since this particular problem only affects one platform, and that platform will start behaving the same way as the other platforms in the not so distant future, it sounds acceptable to just advise those who come across the problem to set the flag. Of course it would be nice if Derby was well-behaved by default on all platforms, but given the easy workaround and the expectation that the problem will soon go away by itself, I'd say that it's probably not worth putting a lot of work into and/or adding more complexity to already complex code.

          Show
          Knut Anders Hatlen added a comment - > So the question becomes, would it be sufficient to advise users to run with this switch set, or should we make Derby tolerate > the interrupts on Solaris here also? How much code do you think will be needed to handle that case? Handling interrupts sent to the engine is somewhat of an edge case in itself. Since this particular problem only affects one platform, and that platform will start behaving the same way as the other platforms in the not so distant future, it sounds acceptable to just advise those who come across the problem to set the flag. Of course it would be nice if Derby was well-behaved by default on all platforms, but given the easy workaround and the expectation that the problem will soon go away by itself, I'd say that it's probably not worth putting a lot of work into and/or adding more complexity to already complex code.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for your thoughts on this, Knut.

          Here is another issue: The NIO Channel#force call is used to implement sync when we flush the Derby log, cf. this call:

          LogToFile#flush -> LogAccessFile#syncLogAccessFile -> StorageRandomAccessFile#sync -> getChannel().force(false)

          The "false" here allows the skipping of flushing metadata, which makes this faster (at least on Solaris) than the non-NIO
          implementation of StorageRandomAccessFile#sync, which uses getFD().sync().

          Unfortunately, an interrupt will make the getChannel().force call throw ClosedChannelException, and I can't see an easy way
          to recover from this, except assuming the log is corrupt and shutting down the database instance, so we can recover on reboot.

          If we switch back to using "getFD().sync()", we would lose the performance optimization, but be resilient to interrupts.

          So, unless somebody can see a way to finesse this problem, should be offer a knob to allow a user to trade this performance optimization for interrupt resilience?

          Aside: we also use StorageRandomAccessFile#sync in many other places, but in those situations I find I am able to recover by closing the RAF, reopening and writing over again, since the data to be written is known, e.g. in LogToFile#initLogFile.

          Show
          Dag H. Wanvik added a comment - Thanks for your thoughts on this, Knut. Here is another issue: The NIO Channel#force call is used to implement sync when we flush the Derby log, cf. this call: LogToFile#flush -> LogAccessFile#syncLogAccessFile -> StorageRandomAccessFile#sync -> getChannel().force(false) The "false" here allows the skipping of flushing metadata, which makes this faster (at least on Solaris) than the non-NIO implementation of StorageRandomAccessFile#sync, which uses getFD().sync(). Unfortunately, an interrupt will make the getChannel().force call throw ClosedChannelException, and I can't see an easy way to recover from this, except assuming the log is corrupt and shutting down the database instance, so we can recover on reboot. If we switch back to using "getFD().sync()", we would lose the performance optimization, but be resilient to interrupts. So, unless somebody can see a way to finesse this problem, should be offer a knob to allow a user to trade this performance optimization for interrupt resilience? Aside: we also use StorageRandomAccessFile#sync in many other places, but in those situations I find I am able to recover by closing the RAF, reopening and writing over again, since the data to be written is known, e.g. in LogToFile#initLogFile.
          Hide
          Dag H. Wanvik added a comment - - edited

          I find that explicit file sync (i.e. not "write sync") is only used if:

          a) NIO is not supported (cf. WritableStorageFactory#supportsWriteSync), in which case we never use FileChannel#force
          b) derby.storage.fileSyncTransactionLog is set to true
          c) we are running on platforms/VMs that don't support opening a file with mask "rws" (write sync with metadata) and "rwd" (write sync without metadata), cf. check in LogToFile#openLogFileInWriteMode which calls checkJvmSyncError to determine this. The platforms mentioned are (e.g. early versions of 1.4.2 and 1.5 on Mac OS and FreeBSD).

          If this holds, it seems to be we can just skip using the NIO version of RAF#sync for those platforms. Other platforms do not use the explicit file sync for the log, cf. the test in LogToFile#flush (snip):

          if (isWriteSynced)

          { //LogAccessFile.flushDirtyBuffers() will allow only one write //sync at a time, flush requests will get queued logOut.flushDirtyBuffers(); }

          else

          { if (!logNotSynced) logOut.syncLogAccessFile(); }

          On most platforms, isWriteSynced is true (cf above), meaning syncLogAccessFile is not used, so reverting to a interrupt safe, less performant non-NIO sync in syncLogAccessFile seems acceptable to me. Opinions?

          Show
          Dag H. Wanvik added a comment - - edited I find that explicit file sync (i.e. not "write sync") is only used if: a) NIO is not supported (cf. WritableStorageFactory#supportsWriteSync), in which case we never use FileChannel#force b) derby.storage.fileSyncTransactionLog is set to true c) we are running on platforms/VMs that don't support opening a file with mask "rws" (write sync with metadata) and "rwd" (write sync without metadata), cf. check in LogToFile#openLogFileInWriteMode which calls checkJvmSyncError to determine this. The platforms mentioned are (e.g. early versions of 1.4.2 and 1.5 on Mac OS and FreeBSD). If this holds, it seems to be we can just skip using the NIO version of RAF#sync for those platforms. Other platforms do not use the explicit file sync for the log, cf. the test in LogToFile#flush (snip): if (isWriteSynced) { //LogAccessFile.flushDirtyBuffers() will allow only one write //sync at a time, flush requests will get queued logOut.flushDirtyBuffers(); } else { if (!logNotSynced) logOut.syncLogAccessFile(); } On most platforms, isWriteSynced is true (cf above), meaning syncLogAccessFile is not used, so reverting to a interrupt safe, less performant non-NIO sync in syncLogAccessFile seems acceptable to me. Opinions?
          Hide
          Mike Matrigali added a comment -

          I agree with the tradeoff you suggest concerning syncing the log file. It seems to make the code easier and will not really affect most users running on modern platforms while still allowing those few to still work correctly on back releases. This especially seems to make sense as I don't see us backporting this change so should only affect 10.8.

          Show
          Mike Matrigali added a comment - I agree with the tradeoff you suggest concerning syncing the log file. It seems to make the code easier and will not really affect most users running on modern platforms while still allowing those few to still work correctly on back releases. This especially seems to make sense as I don't see us backporting this change so should only affect 10.8.
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Mike. Logged and fixed DERBY-4963 to revert NIO version of RAF#sync.

          Show
          Dag H. Wanvik added a comment - Thanks, Mike. Logged and fixed DERBY-4963 to revert NIO version of RAF#sync.
          Hide
          Dag H. Wanvik added a comment -

          Committed a followup patch as svn 1057733. Quoting the commit entry:

          > Followup to patch derby-4741-b-04-nio committed as svn 1038440:
          > changed the string internal exception string "nospc.U" to "intrp.U".

          Show
          Dag H. Wanvik added a comment - Committed a followup patch as svn 1057733. Quoting the commit entry: > Followup to patch derby-4741-b-04-nio committed as svn 1038440: > changed the string internal exception string "nospc.U" to "intrp.U".
          Hide
          Dag H. Wanvik added a comment - - edited

          Uploading a patch derby-4741-sleeps-waits-1 which adds handling of some cases of interrupted monitor wait calls and Thread.sleep calls which I found needed changes to make my preliminary tests work. Running regressions.

          There are remaining places (i.e. usages for wait & sleep) in the code base which need to be inspected for modification. I will be making one or more follow-up patches to handle those. With the current patch this work is approaching a state where Derby embedded(**) is sufficiently robust against interrupts that I will move to buillding systematic as well as randomized tests, and plug remaining holes as we find them.

          After commits of DERBY-4967 and DERBY-4968
          (**) the server code needs consideration too, as well as the client code and replication.

          Show
          Dag H. Wanvik added a comment - - edited Uploading a patch derby-4741-sleeps-waits-1 which adds handling of some cases of interrupted monitor wait calls and Thread.sleep calls which I found needed changes to make my preliminary tests work. Running regressions. There are remaining places (i.e. usages for wait & sleep) in the code base which need to be inspected for modification. I will be making one or more follow-up patches to handle those. With the current patch this work is approaching a state where Derby embedded(**) is sufficiently robust against interrupts that I will move to buillding systematic as well as randomized tests, and plug remaining holes as we find them. After commits of DERBY-4967 and DERBY-4968 (**) the server code needs consideration too, as well as the client code and replication.
          Hide
          Dag H. Wanvik added a comment -

          suites.All passed.

          Show
          Dag H. Wanvik added a comment - suites.All passed.
          Hide
          Dag H. Wanvik added a comment - - edited

          Some details on where patch derby-4741-sleeps-waits-1 changes things.

          The following locations where we earlier threw 08000 (CONN_INTERRUPT),
          we now just note that an interrupt occurred:

          • GenericStatement#prepare:
          • while waiting for another thread to complete the compilation of
            the statement
          • LogToFile
            #checkpointWithTran:
          • while waiting for another thread to complete a checkpoint

          #switchLogFile:

          • while waiting for a semaphore needed to make sure the log switch
            can complete "without having to give up the semaphore to a backup
            or another flusher".

          #flush:

          • while waiting for the database to stop being frozen
          • while waiting for another thread flushing

          #syncFile

          • while waiting a bit before retrying a failed sync call

          #startLogBackup

          • while waiting for a thread doing checkpointing before we start
            backup of log as part of online-backup

          The following locations previously silently swallowed the interrupt, we
          now (also) note that an interrupt occurred:

          • LogToFile
            #recover:
          • in slave replication mode, while waiting for fail-over. This code
            swallowed the interrupt

          #getLogFileAtBeginning:

          • in slave replication mode, while waiting for SlaveFactory to tell
            the thread to do recovery. This code swallowed the interrupt
          • BasePage:
            #setExclusive:
          • while waiting for an exclusive latch
          • while waiting for cleaner to finish

          #setExclusiveNoWait

          • while waiting for cleaner to finish
          Show
          Dag H. Wanvik added a comment - - edited Some details on where patch derby-4741-sleeps-waits-1 changes things. The following locations where we earlier threw 08000 (CONN_INTERRUPT), we now just note that an interrupt occurred: GenericStatement#prepare: while waiting for another thread to complete the compilation of the statement LogToFile #checkpointWithTran: while waiting for another thread to complete a checkpoint #switchLogFile: while waiting for a semaphore needed to make sure the log switch can complete "without having to give up the semaphore to a backup or another flusher". #flush: while waiting for the database to stop being frozen while waiting for another thread flushing #syncFile while waiting a bit before retrying a failed sync call #startLogBackup while waiting for a thread doing checkpointing before we start backup of log as part of online-backup The following locations previously silently swallowed the interrupt, we now (also) note that an interrupt occurred: LogToFile #recover: in slave replication mode, while waiting for fail-over. This code swallowed the interrupt #getLogFileAtBeginning: in slave replication mode, while waiting for SlaveFactory to tell the thread to do recovery. This code swallowed the interrupt BasePage: #setExclusive: while waiting for an exclusive latch while waiting for cleaner to finish #setExclusiveNoWait while waiting for cleaner to finish
          Hide
          Lily Wei added a comment -

          #syncFile

          • while waiting a bit before retrying a failed sync call
            Should the comment change a little in turn of the "does not matter..."?

          I run suites.all against ibm 1.6 jvm along with the LockInterruptTest.java (derby-4967-locking-2.diff) change, test run clean.

          Show
          Lily Wei added a comment - #syncFile while waiting a bit before retrying a failed sync call Should the comment change a little in turn of the "does not matter..."? I run suites.all against ibm 1.6 jvm along with the LockInterruptTest.java (derby-4967-locking-2.diff) change, test run clean.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for test running the two patches, Lily. Good to know these run on the IBM VM as well!
          As for the comment, agreed. I have removed it in my next version of this patch, because it long longer seems to carry important information: earlier practice was to always throw, so then the comment highlighted "unusual" treatment. Now we mostly do not throw, just make a note, so this is code is no longer unusual as I see it.

          Show
          Dag H. Wanvik added a comment - Thanks for test running the two patches, Lily. Good to know these run on the IBM VM as well! As for the comment, agreed. I have removed it in my next version of this patch, because it long longer seems to carry important information: earlier practice was to always throw, so then the comment highlighted "unusual" treatment. Now we mostly do not throw, just make a note, so this is code is no longer unusual as I see it.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a second version of derby-4741-sleeps-waits, rerunning regressions.

          Details on where patch derby-4741-sleeps-waits-2 adds things relative to version 1:

          The following locations previously silently swallowed the interrupt, we now (also) note that an interrupt occurred:

          • LogAccessFile:

          #flushDirtyBuffers:

          • While waiting for another thread flushing

          #syncLogAccessFile:

          • While sleeping a bit before retrying after a failed sync

          With the above, all the waits and sleeps in the package
          org.apache.derby.impl.store.raw.log have been made safe.

          The following locations where we earlier threw 08000 (CONN_INTERRUPT), we now just note that an interrupt occurred:

          • XactFactory:

          #blockBackup:

          • While doing backup blocking operations, we wait for backup to
            finish.

          #blockBackupBlockingOperations:

          • While waiting for all backup blocking operations to complete
          • BaseDataFileFactory:

          #freezePersistentStore:

          • While waiting for writes in progress to finishing before freezing.

          #writeInProgress:

          • While waiting for db thaw before we can start writing
          • CachedPage:
            #clean:
          • Waiting for another thread cleaning it
          • Waiting for state to become unlatched attempting to clean it
          • AsynchronousLogShipper:

          #run:

          • If interrupted during the shipping interval wait, shipping sooner doesn't hurt

          #forceFlush:

          • While waiting a bit after having flushed the master to ship log records to slave to free up buffer space
          • ReplicationMessageReceive:

          #isConnectedToMaster:

          • While waiting for result from the ping thread (the "pong"). Since we may have been interrupted before the result is ready, we try again unless the connection is confirmed up.

          #SlavePingThread#run:

          • While waiting for wakeup to do a ping. I had to introduce an extra state variable here to decide whether after we receive an interrupt, a ping request has also been lodged.
          Show
          Dag H. Wanvik added a comment - Uploading a second version of derby-4741-sleeps-waits, rerunning regressions. Details on where patch derby-4741-sleeps-waits-2 adds things relative to version 1: The following locations previously silently swallowed the interrupt, we now (also) note that an interrupt occurred: LogAccessFile: #flushDirtyBuffers: While waiting for another thread flushing #syncLogAccessFile: While sleeping a bit before retrying after a failed sync With the above, all the waits and sleeps in the package org.apache.derby.impl.store.raw.log have been made safe. The following locations where we earlier threw 08000 (CONN_INTERRUPT), we now just note that an interrupt occurred: XactFactory: #blockBackup: While doing backup blocking operations, we wait for backup to finish. #blockBackupBlockingOperations: While waiting for all backup blocking operations to complete BaseDataFileFactory: #freezePersistentStore: While waiting for writes in progress to finishing before freezing. #writeInProgress: While waiting for db thaw before we can start writing CachedPage: #clean: Waiting for another thread cleaning it Waiting for state to become unlatched attempting to clean it AsynchronousLogShipper: #run: If interrupted during the shipping interval wait, shipping sooner doesn't hurt #forceFlush: While waiting a bit after having flushed the master to ship log records to slave to free up buffer space ReplicationMessageReceive: #isConnectedToMaster: While waiting for result from the ping thread (the "pong"). Since we may have been interrupted before the result is ready, we try again unless the connection is confirmed up. #SlavePingThread#run: While waiting for wakeup to do a ping. I had to introduce an extra state variable here to decide whether after we receive an interrupt, a ping request has also been lodged.
          Hide
          Knut Anders Hatlen added a comment -

          Second revision of sleeps-waits looks good to me. A couple of questions:

          XactFactory.blockBackupBlockingOperations(): Remove RuntimeException from throws clause? The explanation added to the javadoc is probably enough for documentation purposes.

          RAFContainer4.java:

          • Why the change from noteAndClearInterrupt() to setInterrupted()?
          • Whitespace diff @@ -825,7 +819,7 @@ looks unintended

          ReplicationMessageReceive.java: The while loops are only retried if the wait() calls are interrupted, and not if the wait() calls experience spurious wakeups. Any reason for that? (Probably still an improvement compared to the original code, which didn't retry the wait call in any event.)

          Show
          Knut Anders Hatlen added a comment - Second revision of sleeps-waits looks good to me. A couple of questions: XactFactory.blockBackupBlockingOperations(): Remove RuntimeException from throws clause? The explanation added to the javadoc is probably enough for documentation purposes. RAFContainer4.java: Why the change from noteAndClearInterrupt() to setInterrupted()? Whitespace diff @@ -825,7 +819,7 @@ looks unintended ReplicationMessageReceive.java: The while loops are only retried if the wait() calls are interrupted, and not if the wait() calls experience spurious wakeups. Any reason for that? (Probably still an improvement compared to the original code, which didn't retry the wait call in any event.)
          Hide
          Dag H. Wanvik added a comment -

          > XactFactory.blockBackupBlockingOperations(): Remove RuntimeException from throws clause? The explanation added to the javadoc is probably enough for documentation purposes.

          I'm ok with that, I was torn on this.

          > RAFContainer4.java: Why the change from noteAndClearInterrupt() to setInterrupted()?

          Just simplification, since in these two locations we know that an interrupt actually did happen, there is no need to test for the flag as noteAndClearInterrupt does.

          > - Whitespace diff @@ -825,7 +819,7 @@ looks unintended

          Indeed, thanks.

          >
          > ReplicationMessageReceive.java: The while loops are only retried if the wait() calls are interrupted, and not if the wait() calls experience spurious wakeups. Any reason for that? (Probably still an improvement compared to the original code, which didn't retry the wait call in any event.)

          Yes, I agree the original code is deficient in this repect, I have the code changed to handle this as well.

          Show
          Dag H. Wanvik added a comment - > XactFactory.blockBackupBlockingOperations(): Remove RuntimeException from throws clause? The explanation added to the javadoc is probably enough for documentation purposes. I'm ok with that, I was torn on this. > RAFContainer4.java: Why the change from noteAndClearInterrupt() to setInterrupted()? Just simplification, since in these two locations we know that an interrupt actually did happen, there is no need to test for the flag as noteAndClearInterrupt does. > - Whitespace diff @@ -825,7 +819,7 @@ looks unintended Indeed, thanks. > > ReplicationMessageReceive.java: The while loops are only retried if the wait() calls are interrupted, and not if the wait() calls experience spurious wakeups. Any reason for that? (Probably still an improvement compared to the original code, which didn't retry the wait call in any event.) Yes, I agree the original code is deficient in this repect, I have the code changed to handle this as well.
          Hide
          Dag H. Wanvik added a comment -

          Uploading rev #3 of sleeps-waits, rerunning regressions.

          Show
          Dag H. Wanvik added a comment - Uploading rev #3 of sleeps-waits, rerunning regressions.
          Hide
          Dag H. Wanvik added a comment -

          Committed rev #3 as svn 1061516. This completes the cleanup of interrupt handling in Object#wait and Thread.sleep inside store.

          Show
          Dag H. Wanvik added a comment - Committed rev #3 as svn 1061516. This completes the cleanup of interrupt handling in Object#wait and Thread.sleep inside store.
          Hide
          Myrna van Lunteren added a comment -

          I think the latest check-in has caused the following tinderbox failure:

          derbyall/storeall/storeall.fail:unit/T_RawStoreFactory.unit

          see: http://dbtg.foundry.sun.com/derby/test/tinderbox_trunk16/jvm1.6/testing/testlog/SunOS-5.10_i86pc-i386/1061516-derbyall_diff.txt:

                          • Diff file derbyall/storeall/storeunit/T_RawStoreFactory.diff
              • Start: T_RawStoreFactory jdk1.6.0_18 storeall:storeunit 2011-01-20 23:22:23 ***
                2 del
                < – Unit Test T_RawStoreFactory finished
                2 add
                > ran out of time
                > Exit due to time bomb
                Test Failed.
              • End: T_RawStoreFactory jdk1.6.0_18 storeall:storeunit 2011-01-21 00:22:54 ***

          It failed in the nightly runs with ibm 1.6 also (and 1.4.2 and 1.5).
          Do we need a separate jira?

          Show
          Myrna van Lunteren added a comment - I think the latest check-in has caused the following tinderbox failure: derbyall/storeall/storeall.fail:unit/T_RawStoreFactory.unit see: http://dbtg.foundry.sun.com/derby/test/tinderbox_trunk16/jvm1.6/testing/testlog/SunOS-5.10_i86pc-i386/1061516-derbyall_diff.txt: Diff file derbyall/storeall/storeunit/T_RawStoreFactory.diff Start: T_RawStoreFactory jdk1.6.0_18 storeall:storeunit 2011-01-20 23:22:23 *** 2 del < – Unit Test T_RawStoreFactory finished 2 add > ran out of time > Exit due to time bomb Test Failed. End: T_RawStoreFactory jdk1.6.0_18 storeall:storeunit 2011-01-21 00:22:54 *** It failed in the nightly runs with ibm 1.6 also (and 1.4.2 and 1.5). Do we need a separate jira?
          Hide
          Dag H. Wanvik added a comment -

          Uploading *-raf-stresstest-1, which adds a multi-threaded read/write test under an interrupt shower.
          This exercises primarily the random access file recovery (RAFContainer4#recoverContainerAfterInterrupt), but since the interrupt can arrive at any time during query execution, other parts of the code are also exposed.

          The new test case is InterruptResilienceTest#testRAFReadWriteMultipleThreads.

          Running regressions.

          The patch is not for commit yet, since I worry that the test takes too long to be included in the regression suite. During development I had to let it run this long to expose all the corner cases I have seen, but adding 250 seconds may be too much. I could comment out the client/server suite, since it doesn't add much value: the interrupt all happen on the server side. That would cut the time down to ca half. Also, the test may not be entirely reliable yet, since there are remaining parts of the code yet to vetted wrt interrupt handling, but it runs reliably in my environment. I will at least run it on some more platforms before I suggest any commit.

          Please note that due to DERBY-4431, the IBM VMs will not yet be covered by these additional tests.

          Show
          Dag H. Wanvik added a comment - Uploading *-raf-stresstest-1, which adds a multi-threaded read/write test under an interrupt shower. This exercises primarily the random access file recovery (RAFContainer4#recoverContainerAfterInterrupt), but since the interrupt can arrive at any time during query execution, other parts of the code are also exposed. The new test case is InterruptResilienceTest#testRAFReadWriteMultipleThreads. Running regressions. The patch is not for commit yet, since I worry that the test takes too long to be included in the regression suite. During development I had to let it run this long to expose all the corner cases I have seen, but adding 250 seconds may be too much. I could comment out the client/server suite, since it doesn't add much value: the interrupt all happen on the server side. That would cut the time down to ca half. Also, the test may not be entirely reliable yet, since there are remaining parts of the code yet to vetted wrt interrupt handling, but it runs reliably in my environment. I will at least run it on some more platforms before I suggest any commit. Please note that due to DERBY-4431 , the IBM VMs will not yet be covered by these additional tests.
          Hide
          Dag H. Wanvik added a comment - - edited

          Myrna, I will look into it. Filed DERBY-4982 for this.

          Show
          Dag H. Wanvik added a comment - - edited Myrna, I will look into it. Filed DERBY-4982 for this.
          Hide
          Dag H. Wanvik added a comment -

          Regrressions for *-raf-stresstest-1, and I verified that the modified InterruptResilienceTest works on Windows as well. For some reason it executed much more quickly under Windows, ca 50s (the reported 250 above was on Solaris). I'll see I can figure out why.

          Show
          Dag H. Wanvik added a comment - Regrressions for *-raf-stresstest-1, and I verified that the modified InterruptResilienceTest works on Windows as well. For some reason it executed much more quickly under Windows, ca 50s (the reported 250 above was on Solaris). I'll see I can figure out why.
          Hide
          Knut Anders Hatlen added a comment -

          Perhaps the difference is whether or not the OS will force changes to disk when doing synchronous writes?

          Show
          Knut Anders Hatlen added a comment - Perhaps the difference is whether or not the OS will force changes to disk when doing synchronous writes?
          Hide
          Knut Anders Hatlen added a comment -

          The use of the static field thisTest may have some consequences that are not entirely obvious:

          The field is set when the test suite is created, so when the tests start running it will always point to the last test case that was created, not necessarily to the running test case. That is, it will always point to a client test case. Because of this, one may be led to believe that the test that calls thisTest.openDefaultConnection() always uses client connections. However, BaseJDBCTestCase.openDefaultConnection() calls TestConfiguration.getCurrent() to get the current test configuration, so the connection type should be chosen dynamically based on the active decorators. But this assumption is also not quite true. The decorators only affect the TestConfiguration for the main thread, so when running inside a stored procedure in a server thread, the decorators will have no effect, and an embedded connection will be opened even if the test case is actually running within a client/server decorator.

          I think the intention is to always use embedded connections inside the stored procedure, and that's also what the code does (so, as mentioned in an earlier comment, there's probably not much point in running this test in client/server mode?). But it takes some effort to convince oneself that that's what it actually does. Do you think it's possible to make it more explicit that the worker threads always use embedded connections?

          And some little nits:

          + if (w.e != null)

          { + fail("WorkerThread " + i + " saw exception " + w.e); + }

          Use the fail() method that takes an exception to preserve the stack trace?

          + ResultSet rs = s.executeQuery("select count from mtTab");
          + rs.next();
          + assertEquals("too few rows inserted",
          + rs.getLong(1), NO_OF_THREADS * NO_OF_MT_OPS);
          + rs.close();

          How do you know it must be too few and not too many? JDBC.assertSingleValueResultSet() could simplify this code.

          Show
          Knut Anders Hatlen added a comment - The use of the static field thisTest may have some consequences that are not entirely obvious: The field is set when the test suite is created, so when the tests start running it will always point to the last test case that was created, not necessarily to the running test case. That is, it will always point to a client test case. Because of this, one may be led to believe that the test that calls thisTest.openDefaultConnection() always uses client connections. However, BaseJDBCTestCase.openDefaultConnection() calls TestConfiguration.getCurrent() to get the current test configuration, so the connection type should be chosen dynamically based on the active decorators. But this assumption is also not quite true. The decorators only affect the TestConfiguration for the main thread, so when running inside a stored procedure in a server thread, the decorators will have no effect, and an embedded connection will be opened even if the test case is actually running within a client/server decorator. I think the intention is to always use embedded connections inside the stored procedure, and that's also what the code does (so, as mentioned in an earlier comment, there's probably not much point in running this test in client/server mode?). But it takes some effort to convince oneself that that's what it actually does. Do you think it's possible to make it more explicit that the worker threads always use embedded connections? And some little nits: + if (w.e != null) { + fail("WorkerThread " + i + " saw exception " + w.e); + } Use the fail() method that takes an exception to preserve the stack trace? + ResultSet rs = s.executeQuery("select count from mtTab"); + rs.next(); + assertEquals("too few rows inserted", + rs.getLong(1), NO_OF_THREADS * NO_OF_MT_OPS); + rs.close(); How do you know it must be too few and not too many? JDBC.assertSingleValueResultSet() could simplify this code.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for looking at this, Knut!
          You are right, it got a bit convoluted. The intention is, as you suggest, to always embedded connections inside the stored procedure.
          I'll see if I can simplify this a bit! I

          always agonize about whether to include c/s tests or not when the changes are seemingly embedded/server-side only, maybe we should try to establish some criteria to help us decide when to, and when not to do that.

          Thanks also for spotting the nits, I converted this from a stand-alone test outside JUnit, so it can be further simplied as you suggest.

          Show
          Dag H. Wanvik added a comment - Thanks for looking at this, Knut! You are right, it got a bit convoluted. The intention is, as you suggest, to always embedded connections inside the stored procedure. I'll see if I can simplify this a bit! I always agonize about whether to include c/s tests or not when the changes are seemingly embedded/server-side only, maybe we should try to establish some criteria to help us decide when to, and when not to do that. Thanks also for spotting the nits, I converted this from a stand-alone test outside JUnit, so it can be further simplied as you suggest.
          Hide
          Dag H. Wanvik added a comment -

          Uploaded *raf_stresstest-2 which supersedes #1. In addition to fixing Knut's nits,
          the server side code now uses a thread's default test configuration directly, and I also added a comment explaining this. I think I will commit the patch with the long running time for now. I'd rather flush out any bugs here sooner rather than later.

          Show
          Dag H. Wanvik added a comment - Uploaded *raf_stresstest-2 which supersedes #1. In addition to fixing Knut's nits, the server side code now uses a thread's default test configuration directly, and I also added a comment explaining this. I think I will commit the patch with the long running time for now. I'd rather flush out any bugs here sooner rather than later.
          Hide
          Knut Anders Hatlen added a comment -

          > I think I will commit the patch with the long running time for now. I'd rather flush out any bugs here sooner rather than later.

          Sounds good. +1

          > I always agonize about whether to include c/s tests or not when the changes are seemingly embedded/server-side only

          I don't have a general rule, but in this particular case it looks like we run the entire test inside a stored procedure in order to make sure that only embedded is tested. Sounds like a good indication that embedded only would suffice... We could probably simplify the test somewhat and remove the stored procedure indirection too if we made it embedded only.

          Show
          Knut Anders Hatlen added a comment - > I think I will commit the patch with the long running time for now. I'd rather flush out any bugs here sooner rather than later. Sounds good. +1 > I always agonize about whether to include c/s tests or not when the changes are seemingly embedded/server-side only I don't have a general rule, but in this particular case it looks like we run the entire test inside a stored procedure in order to make sure that only embedded is tested. Sounds like a good indication that embedded only would suffice... We could probably simplify the test somewhat and remove the stored procedure indirection too if we made it embedded only.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a revised version #3 which makes use of derby.system.durability=testing to cut down on the
          running time on Solaris, I found it will still get to the corner cases.

          Show
          Dag H. Wanvik added a comment - Uploading a revised version #3 which makes use of derby.system.durability=testing to cut down on the running time on Solaris, I found it will still get to the corner cases.
          Hide
          Dag H. Wanvik added a comment -

          Uploading version #4 (derby-4741-raf-stresstest-4). This lets the test run with its own database since we use durability=test: just to be on the safe side, so other tests don't get impacted by the fact that the database is thus marked, cf. message in derby.log:

          "WARNING: The database is booted with derby.system.durability=test. In this mode, it is possible that database may not be able to recover, committed transactions may be lost, database may be in an inconsistent state. Please use this mode only when these consequences are acceptable" (sic: no final period)

          I has to add a class.forName to DriverManagerConnector#getConnectionByAttributes and a new public method BasicJDBCTestCase#openDefaultConnection(TestConfiguration). The latter makes it possible use the main thread's test configuration in the server threads, cf. "thisConfig" member in InterruptResilienceTest.

          I also added derby.infolog.append=true to SystemPropertyTestSetup, so the c/s run doesn't clobber the embedded run's derby.log.

          Rerunning regressions.

          Show
          Dag H. Wanvik added a comment - Uploading version #4 (derby-4741-raf-stresstest-4). This lets the test run with its own database since we use durability=test: just to be on the safe side, so other tests don't get impacted by the fact that the database is thus marked, cf. message in derby.log: "WARNING: The database is booted with derby.system.durability=test. In this mode, it is possible that database may not be able to recover, committed transactions may be lost, database may be in an inconsistent state. Please use this mode only when these consequences are acceptable" (sic: no final period) I has to add a class.forName to DriverManagerConnector#getConnectionByAttributes and a new public method BasicJDBCTestCase#openDefaultConnection(TestConfiguration). The latter makes it possible use the main thread's test configuration in the server threads, cf. "thisConfig" member in InterruptResilienceTest. I also added derby.infolog.append=true to SystemPropertyTestSetup, so the c/s run doesn't clobber the embedded run's derby.log. Rerunning regressions.
          Hide
          Dag H. Wanvik added a comment -

          Regressions ran ok, committed derby-4741-raf-stresstest-4 as svn 1064174.

          Show
          Dag H. Wanvik added a comment - Regressions ran ok, committed derby-4741-raf-stresstest-4 as svn 1064174.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a functional specification for this work to summarize what we have so far: "interrupts-fs.html"

          Show
          Dag H. Wanvik added a comment - Uploading a functional specification for this work to summarize what we have so far: "interrupts-fs.html"
          Hide
          Dag H. Wanvik added a comment -

          Uploading a patch, *-testQueryInterrupt, which:

          • adds a new test case: InterruptResilienceTest#testLongQueryInterrupt

          which tests that a query will check for the interrupt flag and throw 08000 (CONN_INTERRUPT) at the same time is checks for query time-out.

          • adds a missing test in InterruptStatus#throwIf

          I also adjusted an existing test (for RAF recovery) to handle the case that we could see 08000 (CONN_INTERRUPT) when performing a query as part of that test, depending on when the interrupt happens.

          Regressions passed.

          Show
          Dag H. Wanvik added a comment - Uploading a patch, *-testQueryInterrupt, which: adds a new test case: InterruptResilienceTest#testLongQueryInterrupt which tests that a query will check for the interrupt flag and throw 08000 (CONN_INTERRUPT) at the same time is checks for query time-out. adds a missing test in InterruptStatus#throwIf I also adjusted an existing test (for RAF recovery) to handle the case that we could see 08000 (CONN_INTERRUPT) when performing a query as part of that test, depending on when the interrupt happens. Regressions passed.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a new version of the draft functional specification, version 0.2. Please have a look!

          Show
          Dag H. Wanvik added a comment - Uploading a new version of the draft functional specification, version 0.2. Please have a look!
          Hide
          Dag H. Wanvik added a comment -

          Uploading another patch *-testBatchInterrupt, which adds another fixture to InterruptResilienceTest: #testInterruptBatch, rerunning regressions.

          It tests that an interrupt will stop a batch of statements by throwing just before we execute the next statement in the batch.

          Note that this patch includes all of *.testQueryInterrupt also for logistics reason (that patch is not yet committed), but I will post a slimmed down version of this as soon as that one is in.

          Show
          Dag H. Wanvik added a comment - Uploading another patch *-testBatchInterrupt, which adds another fixture to InterruptResilienceTest: #testInterruptBatch, rerunning regressions. It tests that an interrupt will stop a batch of statements by throwing just before we execute the next statement in the batch. Note that this patch includes all of *.testQueryInterrupt also for logistics reason (that patch is not yet committed), but I will post a slimmed down version of this as soon as that one is in.
          Hide
          Dag H. Wanvik added a comment -

          Committed patch derby-4741-testQueryInterrupt as svn 1066701.

          Show
          Dag H. Wanvik added a comment - Committed patch derby-4741-testQueryInterrupt as svn 1066701.
          Hide
          Dag H. Wanvik added a comment -

          Uploading version a slimmed-down version of *-testBatchInterrupt: derby-4741-testBatchInterrupt-b.

          Show
          Dag H. Wanvik added a comment - Uploading version a slimmed-down version of *-testBatchInterrupt: derby-4741-testBatchInterrupt-b.
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-4741-testBatchInterrupt-b as svn 1066707.

          Show
          Dag H. Wanvik added a comment - Committed derby-4741-testBatchInterrupt-b as svn 1066707.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for writing up the functional spec, Dag. I found that it clearly explained the behaviour, and as far as I could see it matched what was actually implemented. It doesn't mention anything about how to handle the Solaris interruptible I/O issue on JVMs later than 1.6 (in which case there won't be an issue, right?), but that's probably intentional since 1.7 is still only available as early access?

          Show
          Knut Anders Hatlen added a comment - Thanks for writing up the functional spec, Dag. I found that it clearly explained the behaviour, and as far as I could see it matched what was actually implemented. It doesn't mention anything about how to handle the Solaris interruptible I/O issue on JVMs later than 1.6 (in which case there won't be an issue, right?), but that's probably intentional since 1.7 is still only available as early access?
          Hide
          Dag H. Wanvik added a comment -

          Yes, you are right, Knut. I did omit reference to the 1.7 behavior on Solaris, but it might be good to include it in any case for completeness. After all Java 1.7 should be available before too long Thanks for reviewing the specification.

          Show
          Dag H. Wanvik added a comment - Yes, you are right, Knut. I did omit reference to the 1.7 behavior on Solaris, but it might be good to include it in any case for completeness. After all Java 1.7 should be available before too long Thanks for reviewing the specification.
          Hide
          Rick Hillegas added a comment -

          Hi Dag,

          Thanks for the functional spec. I am trying to wrap my mind around what application writers can expect. Is the following a fair summary?

          1) If the Thread is interrupted while running outside Derby (that is, not inside a JDBC call), then Derby does not field the interrupt. The Connection remains alive and Thread.isInterrupted() will not be touched by Derby.

          2) On the other hand, if the Thread is interrupted while running inside a Derby JDBC call...

          a) The Connection may be killed and have to be restarted. If the Connection is killed, then Thread.isInterrupted() will be true.

          b) Alternatively, it is also possible that the interrupt will not kill the Connection. In this case, the state of Thread.isInterrupted() is unclear to me. I am having a hard time reconciling "will be still be set...when exiting from the Derby API method" with "it is not deterministic whether the flag will remain set or be cleared on return".

          Thanks!

          Show
          Rick Hillegas added a comment - Hi Dag, Thanks for the functional spec. I am trying to wrap my mind around what application writers can expect. Is the following a fair summary? 1) If the Thread is interrupted while running outside Derby (that is, not inside a JDBC call), then Derby does not field the interrupt. The Connection remains alive and Thread.isInterrupted() will not be touched by Derby. 2) On the other hand, if the Thread is interrupted while running inside a Derby JDBC call... a) The Connection may be killed and have to be restarted. If the Connection is killed, then Thread.isInterrupted() will be true. b) Alternatively, it is also possible that the interrupt will not kill the Connection. In this case, the state of Thread.isInterrupted() is unclear to me. I am having a hard time reconciling "will be still be set...when exiting from the Derby API method" with "it is not deterministic whether the flag will remain set or be cleared on return". Thanks!
          Hide
          Dag H. Wanvik added a comment -

          Thanks for reading the spec, Rick.

          1) Not true. If the flag is set at the time the Derby API call, Derby may throw. Derby can't know exactly when the thread was interrupted, it can only notice it by either a) seeing Java API methods return exceptions, or b) testing the flag explicitly. We do not test the flag on Derby API method entry. We heed (i.e. throw) interrupts in the specified situations enumerated in the spec.

          2b) The flag will remain set when exitting from in all cases. The note about deterministic only applies if the flag is attemped cleared by another thread while the Derby API call is still executing. If we ever saw an interrupt and had to clear it to make execution continue, e.g. for NIO, we set it again at Derby API method exit, whether we throw or not. This means that any attempt to clear it while Derby is executing could be lost, depending on exactly when the clearing attempt was made, hence "non-deterministic".

          Show
          Dag H. Wanvik added a comment - Thanks for reading the spec, Rick. 1) Not true. If the flag is set at the time the Derby API call, Derby may throw. Derby can't know exactly when the thread was interrupted, it can only notice it by either a) seeing Java API methods return exceptions, or b) testing the flag explicitly. We do not test the flag on Derby API method entry. We heed (i.e. throw) interrupts in the specified situations enumerated in the spec. 2b) The flag will remain set when exitting from in all cases. The note about deterministic only applies if the flag is attemped cleared by another thread while the Derby API call is still executing. If we ever saw an interrupt and had to clear it to make execution continue, e.g. for NIO, we set it again at Derby API method exit, whether we throw or not. This means that any attempt to clear it while Derby is executing could be lost, depending on exactly when the clearing attempt was made, hence "non-deterministic".
          Hide
          Rick Hillegas added a comment -

          Thanks, Dag. Some follow-on questions:

          1) Suppose my application has two threads T1 and T2, each with its own Connection. T1 is busy interacting with the user and is NOT stalled inside a JDBC call. T2 is running inside a JDBC call. Now T2 is interrupted. Can this kill T1's Connection?

          2) Does this give rise to any advice for application writers?

          Thanks,
          -Rick

          Show
          Rick Hillegas added a comment - Thanks, Dag. Some follow-on questions: 1) Suppose my application has two threads T1 and T2, each with its own Connection. T1 is busy interacting with the user and is NOT stalled inside a JDBC call. T2 is running inside a JDBC call. Now T2 is interrupted. Can this kill T1's Connection? 2) Does this give rise to any advice for application writers? Thanks, -Rick
          Hide
          Dag H. Wanvik added a comment -

          Hi Rick,
          1) The answer is no. Only T2 risk seeing 08000. Nor would T1 risk it if it were inside a JDBC call, since this thread is not being interrupted.
          2) Applications that expect to interrupt threads while they are executing JDBC calls should be prepared to receive a session level error with SQLState 08000, and so would need to reconnect in order to get more work done.

          Show
          Dag H. Wanvik added a comment - Hi Rick, 1) The answer is no. Only T2 risk seeing 08000. Nor would T1 risk it if it were inside a JDBC call, since this thread is not being interrupted. 2) Applications that expect to interrupt threads while they are executing JDBC calls should be prepared to receive a session level error with SQLState 08000, and so would need to reconnect in order to get more work done.
          Hide
          Rick Hillegas added a comment -

          Thanks, Dag. Here is another attempt to summarize the user-visible behavior after these improvements.

          If an application experiences Thread interrupts, its designer should plan for the following behavior:

          1) If a Thread is interrupted inside a Derby JDBC call, then the Connection experiencing the interrupt MAY be terminated. If the Connection is terminated, the application will experience the following consequences:

          a) The JDBC call will raise a 08000 (CONN_INTERRUPT) exception.

          b) Outstanding transactional work on that Connection will be rolled back and all of its locks will be released.

          c) The Connection will not execute any further JDBC calls.

          d) On return from the interrupted JDBC call, the Thread's isInterrupted() method will report "true".

          2) All other Connections will remain active. This includes other Connections which the interrupted Thread may be using.

          3) Application designers are advised to catch these 08000 exceptions, discard the dead Connection, and restart their transaction in a new Connection.

          Show
          Rick Hillegas added a comment - Thanks, Dag. Here is another attempt to summarize the user-visible behavior after these improvements. If an application experiences Thread interrupts, its designer should plan for the following behavior: 1) If a Thread is interrupted inside a Derby JDBC call, then the Connection experiencing the interrupt MAY be terminated. If the Connection is terminated, the application will experience the following consequences: a) The JDBC call will raise a 08000 (CONN_INTERRUPT) exception. b) Outstanding transactional work on that Connection will be rolled back and all of its locks will be released. c) The Connection will not execute any further JDBC calls. d) On return from the interrupted JDBC call, the Thread's isInterrupted() method will report "true". 2) All other Connections will remain active. This includes other Connections which the interrupted Thread may be using. 3) Application designers are advised to catch these 08000 exceptions, discard the dead Connection, and restart their transaction in a new Connection.
          Hide
          Dag H. Wanvik added a comment -

          > 1) If a Thread is interrupted inside a Derby JDBC call, then the Connection experiencing the interrupt MAY be terminated. If the Connection is terminated, the application will experience the following consequences:

          To be even more precise: "If a Thread is interrupted and the flag is not cleared before entering a Derby JDBC call or the thread is interrupted while inside a Derby JDBC call, then the Connection experiencing the interrupt MAY be terminated. If the Connection is terminated, the application will experience the following consequences:"

          > a) The JDBC call will raise a 08000 (CONN_INTERRUPT) exception.

          Yes

          >
          > b) Outstanding transactional work on that Connection will be rolled back and all of its locks will be released.

          Yes

          > c) The Connection will not execute any further JDBC calls.

          Yes

          >
          > d) On return from the interrupted JDBC call, the Thread's isInterrupted() method will report "true".

          Yes.

          If Derby does not throw, the Thread's isInterrupted() method will also report "true".

          > 2) All other Connections will remain active. This includes other Connections which the interrupted Thread may be using.

          Yes, at least until such time as the thread, still having its interrupted flag set, tries to use another connection, in which case, this connection is also subject to termination.

          >
          > 3) Application designers are advised to catch these 08000 exceptions, discard the dead Connection, and restart their transaction in a new Connection.

          And clear the interrupted flag of the thread before proceeding with another Derby connection, preferably!

          The "MAY" presently means:
          if a query fetches rows from a base table after the interrupt has occurred, or
          a new element in a batched statement is attempted executed after the interrupt has occurred, or
          an interrupt is received while waiting for a lock.

          In other cases, it is ignored.
          .
          Note the slight difference above, between after and while. I have clarified this in a new revision.

          Show
          Dag H. Wanvik added a comment - > 1) If a Thread is interrupted inside a Derby JDBC call, then the Connection experiencing the interrupt MAY be terminated. If the Connection is terminated, the application will experience the following consequences: To be even more precise: "If a Thread is interrupted and the flag is not cleared before entering a Derby JDBC call or the thread is interrupted while inside a Derby JDBC call, then the Connection experiencing the interrupt MAY be terminated. If the Connection is terminated, the application will experience the following consequences:" > a) The JDBC call will raise a 08000 (CONN_INTERRUPT) exception. Yes > > b) Outstanding transactional work on that Connection will be rolled back and all of its locks will be released. Yes > c) The Connection will not execute any further JDBC calls. Yes > > d) On return from the interrupted JDBC call, the Thread's isInterrupted() method will report "true". Yes. If Derby does not throw, the Thread's isInterrupted() method will also report "true". > 2) All other Connections will remain active. This includes other Connections which the interrupted Thread may be using. Yes, at least until such time as the thread, still having its interrupted flag set, tries to use another connection, in which case, this connection is also subject to termination. > > 3) Application designers are advised to catch these 08000 exceptions, discard the dead Connection, and restart their transaction in a new Connection. And clear the interrupted flag of the thread before proceeding with another Derby connection, preferably! The "MAY" presently means: if a query fetches rows from a base table after the interrupt has occurred, or a new element in a batched statement is attempted executed after the interrupt has occurred, or an interrupt is received while waiting for a lock. In other cases, it is ignored. . Note the slight difference above, between after and while . I have clarified this in a new revision.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a new version of the func spec.

          Show
          Dag H. Wanvik added a comment - Uploading a new version of the func spec.
          Hide
          Rick Hillegas added a comment -

          Thanks for the extra clarification, Dag. That will be useful for programmers who want to understand the code. I think we have a clear enough picture of the user experience that we can document this feature now. Thanks.

          Show
          Rick Hillegas added a comment - Thanks for the extra clarification, Dag. That will be useful for programmers who want to understand the code. I think we have a clear enough picture of the user experience that we can document this feature now. Thanks.
          Hide
          Mike Matrigali added a comment -

          triaged for 10.8.

          A lot of progress has been made on this issue, and it looks ready for 10.8. Work is being done incrementally, probably soon should close this one based on checked in work, and log a new one to track possible work post 10.8.

          Show
          Mike Matrigali added a comment - triaged for 10.8. A lot of progress has been made on this issue, and it looks ready for 10.8. Work is being done incrementally, probably soon should close this one based on checked in work, and log a new one to track possible work post 10.8.
          Hide
          Dag H. Wanvik added a comment -

          Uploading patch derby-4741-sleeps-waits-more. This patch contains the
          rest of the code locations containg calls to Object#wait and
          Thread#sleep that I intend to make interrupt proof for this issue. It
          is not ready for commit, but please have a look. (I am still
          considering a few more, including occurences in BasicDaemon.java).

          Also uploading the list of occurences I used as input:
          wait-0-usages.txt, wait-1-usages.txt, sleep-1-usages.txt, annotated as
          follows: "v" fixed or not relevant (e.g. test code), "-" I will not
          change (not embedded Derby, cf. functional specification) and "?"
          still pondering what to do (BasicDaemon).

          Regressions ran ok, although it doesn't say much, since most of the
          changes only apply when interrupts happen.

          Show
          Dag H. Wanvik added a comment - Uploading patch derby-4741-sleeps-waits-more. This patch contains the rest of the code locations containg calls to Object#wait and Thread#sleep that I intend to make interrupt proof for this issue. It is not ready for commit, but please have a look. (I am still considering a few more, including occurences in BasicDaemon.java). Also uploading the list of occurences I used as input: wait-0-usages.txt, wait-1-usages.txt, sleep-1-usages.txt, annotated as follows: "v" fixed or not relevant (e.g. test code), "-" I will not change (not embedded Derby, cf. functional specification) and "?" still pondering what to do (BasicDaemon). Regressions ran ok, although it doesn't say much, since most of the changes only apply when interrupts happen.
          Hide
          Dag H. Wanvik added a comment -

          Uploading version 0.4 of the functional specification. Adds items to the list of unmodified behavior.

          Show
          Dag H. Wanvik added a comment - Uploading version 0.4 of the functional specification. Adds items to the list of unmodified behavior.
          Hide
          Dag H. Wanvik added a comment -

          Dev guide:

          > Do not use interrupt calls to notify threads that are accessing a
          > database, because Derby will catch the interrupt call and close the
          > connection to the database. Use wait and notify calls instead.

          This advice stands still, I think. The modal verb "will" is slightly
          inaccurate here, it would be more correct to use "could" or
          "might". The functional specification spells this out in detail when
          this can happen.

          Interrupts can now be used to stop long-running queries, batches and
          statements that wait for database locks. In all cases, the connection
          will be closed along with the exception seen (state 08000), cf the specification.

          Show
          Dag H. Wanvik added a comment - Dev guide: > Do not use interrupt calls to notify threads that are accessing a > database, because Derby will catch the interrupt call and close the > connection to the database. Use wait and notify calls instead. This advice stands still, I think. The modal verb "will" is slightly inaccurate here, it would be more correct to use "could" or "might". The functional specification spells this out in detail when this can happen. Interrupts can now be used to stop long-running queries, batches and statements that wait for database locks. In all cases, the connection will be closed along with the exception seen (state 08000), cf the specification.
          Hide
          Kim Haase added a comment -

          Thanks! Does "long-running" apply to all three nouns ("queries, batches and statements that...") or just the first?

          Show
          Kim Haase added a comment - Thanks! Does "long-running" apply to all three nouns ("queries, batches and statements that...") or just the first?
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-4741-sleeps-waits-more as svn 1073595:

          Commit message:
          Patch derby-4741-sleeps-waits-more, which "regularizes" a few more
          instances of interrrupt handling to follow the idiom established in
          this issue's patches.

          This leaves a few instances in BasicDaemon.java (as far as embedded
          code is concerned), which will need more consideration. In any case,
          interrupting the demon threads is less of a valid use case I believe,
          i.e. Derby's ability to tolerate that is less crucial that tolerating
          interrupts to the user's connection threads.

          Show
          Dag H. Wanvik added a comment - Committed derby-4741-sleeps-waits-more as svn 1073595: Commit message: Patch derby-4741-sleeps-waits-more, which "regularizes" a few more instances of interrrupt handling to follow the idiom established in this issue's patches. This leaves a few instances in BasicDaemon.java (as far as embedded code is concerned), which will need more consideration. In any case, interrupting the demon threads is less of a valid use case I believe, i.e. Derby's ability to tolerate that is less crucial that tolerating interrupts to the user's connection threads.
          Hide
          Dag H. Wanvik added a comment -

          (reposting my answer from email here, since it didnt seem to make it to the JIRA issue)

          Kim> Does "long-running" apply to all three nouns ("queries, batches and statements that...") or just the first?

          Good question. All: I guess the definition of "long" here is relative to
          the applications expectations. The third one (locks) is limited upward
          to the lock timeout for a single lock grab, but that may not be the
          whole story since the statement may need further locks to complete. So
          "longer than expected" is perhaps more to the point..

          Show
          Dag H. Wanvik added a comment - (reposting my answer from email here, since it didnt seem to make it to the JIRA issue) Kim> Does "long-running" apply to all three nouns ("queries, batches and statements that...") or just the first? Good question. All: I guess the definition of "long" here is relative to the applications expectations. The third one (locks) is limited upward to the lock timeout for a single lock grab, but that may not be the whole story since the statement may need further locks to complete. So "longer than expected" is perhaps more to the point..
          Hide
          Knut Anders Hatlen added a comment -

          The last commit (#1073595) seems to have introduced some dependency that breaks the build in the Tinderbox: http://dbtg.foundry.sun.com/derby/test/tinderbox_trunk16/UpdateInfo/1073598-buildDetails.txt

          I am able to reproduce it if I explicitly point the jsr169compile.classpath property to Foundation + JSR-169 jars. I'm not sure exactly what's causing it, but it looks like the iapi.types package now is compiled via the compile_mbeans target. That target uses compile.classpath, which is set to jsr169compile.classpath in build.xml, whereas the target that normally compiles SqlXmlUtil requires some extra XML libraries on the compile classpath.

          Show
          Knut Anders Hatlen added a comment - The last commit (#1073595) seems to have introduced some dependency that breaks the build in the Tinderbox: http://dbtg.foundry.sun.com/derby/test/tinderbox_trunk16/UpdateInfo/1073598-buildDetails.txt I am able to reproduce it if I explicitly point the jsr169compile.classpath property to Foundation + JSR-169 jars. I'm not sure exactly what's causing it, but it looks like the iapi.types package now is compiled via the compile_mbeans target. That target uses compile.classpath, which is set to jsr169compile.classpath in build.xml, whereas the target that normally compiles SqlXmlUtil requires some extra XML libraries on the compile classpath.
          Hide
          Dag H. Wanvik added a comment -

          I backed out a change to TopService.java until we understand how to make it build correctly in those circumstances.
          svn 1073846.

          Show
          Dag H. Wanvik added a comment - I backed out a change to TopService.java until we understand how to make it build correctly in those circumstances. svn 1073846.
          Hide
          Dag H. Wanvik added a comment - - edited

          Uploading a patch which reintroduces the fix that was backed out with changes to the build. Essentially it moves the compilation of MBeans out till after the iapis have been built. Thanks to Knut for pinpointing the feact that compilation of MBeans depends on Monitor, but is in fact compiled before the iapi currently. Setting the sourcepath for javac to "" for mbeans-compile highlighted the issue, cf this comment: https://issues.apache.org/jira/browse/DERBY-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998479#comment-12998479

          Still,exactly why this makes it compile with explicit setting of jsr169compile.classpath now, whereas it did not before, is still a bit unclear to me. It seems to work now, though. Running regressions.

          Show
          Dag H. Wanvik added a comment - - edited Uploading a patch which reintroduces the fix that was backed out with changes to the build. Essentially it moves the compilation of MBeans out till after the iapis have been built. Thanks to Knut for pinpointing the feact that compilation of MBeans depends on Monitor, but is in fact compiled before the iapi currently. Setting the sourcepath for javac to "" for mbeans-compile highlighted the issue, cf this comment: https://issues.apache.org/jira/browse/DERBY-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998479#comment-12998479 Still,exactly why this makes it compile with explicit setting of jsr169compile.classpath now, whereas it did not before, is still a bit unclear to me. It seems to work now, though. Running regressions.
          Hide
          Lily Wei added a comment -

          It is not totally clear to me why setting for the sourcepath and jsr169compile.classpath make the build work on all the platforms. More explanation will help people like me. :-
          Suites.all and derbyall run clean with derby-4741-fix-compilation.diff patch on my Windows 7. +1 for me to commit. Cheers and Thanks!

          Show
          Lily Wei added a comment - It is not totally clear to me why setting for the sourcepath and jsr169compile.classpath make the build work on all the platforms. More explanation will help people like me. :- Suites.all and derbyall run clean with derby-4741-fix-compilation.diff patch on my Windows 7. +1 for me to commit. Cheers and Thanks!
          Hide
          Dag H. Wanvik added a comment -

          Hi Lily, it is not clear to me either The dependencies in our
          present compile is partly hard coded in the build.xml files, but that
          is not the whole story. It seems that if the compiler notices it lacks
          a class, it can check for it in the source path and compile it outside
          the normal order to fulfill the dependency. Setting the classpath to
          "" denies the compiler task this ability, forcing the hard coded
          dependencies to be the unique determinants of compilation order. Knut
          did an experiment to do that for most compiles, but it quickly turned
          messy as far as I understood.

          Cf this quote from the javac task doc (http://www.jajakarta.org/ant/ant-1.6.1/docs/en/manual/CoreTasks/javac.html):

          "If you wish to compile only files explicitly specified and disable
          javac's default searching mechanism then you can unset the sourcepath
          attribute:

          <javac sourcepath="" srcdir="$

          {src}"
          destdir="${build}" >
          <include name="*/.java" />
          <exclude name="**/Example.java" />
          </javac>

          That way the javac will compile all java source files under "${src}

          "
          directory but skip the examples. The compiler will even produce errors
          if some of the non-example files refers to them. "

          So far so good. I can't explain why setting jsr169compile.classpath
          explicitly made things break though. I just fiddled with the placement
          of the compilation of mbeans till it worked... But in general, it
          matters under which javac tash a class is compiled in Derby, since we
          use so many tweaks for classpath handling to enable seperate Java
          versions, jsr169, JDBC3, 4... I think it woul dbe good if we could
          move to having the dependencies better understood and encoded in the
          build files, cf. Knut's experiment, so we can untangle all these
          intricate dependencies, and suffer less every time anything changes..

          Show
          Dag H. Wanvik added a comment - Hi Lily, it is not clear to me either The dependencies in our present compile is partly hard coded in the build.xml files, but that is not the whole story. It seems that if the compiler notices it lacks a class, it can check for it in the source path and compile it outside the normal order to fulfill the dependency. Setting the classpath to "" denies the compiler task this ability, forcing the hard coded dependencies to be the unique determinants of compilation order. Knut did an experiment to do that for most compiles, but it quickly turned messy as far as I understood. Cf this quote from the javac task doc ( http://www.jajakarta.org/ant/ant-1.6.1/docs/en/manual/CoreTasks/javac.html): "If you wish to compile only files explicitly specified and disable javac's default searching mechanism then you can unset the sourcepath attribute: <javac sourcepath="" srcdir="$ {src}" destdir="${build}" > <include name="* / .java" /> <exclude name="**/Example.java" /> </javac> That way the javac will compile all java source files under "${src} " directory but skip the examples. The compiler will even produce errors if some of the non-example files refers to them. " So far so good. I can't explain why setting jsr169compile.classpath explicitly made things break though. I just fiddled with the placement of the compilation of mbeans till it worked... But in general, it matters under which javac tash a class is compiled in Derby, since we use so many tweaks for classpath handling to enable seperate Java versions, jsr169, JDBC3, 4... I think it woul dbe good if we could move to having the dependencies better understood and encoded in the build files, cf. Knut's experiment, so we can untangle all these intricate dependencies, and suffer less every time anything changes..
          Hide
          Dag H. Wanvik added a comment -

          Committed 1074789 (derby-4741-fix-compilation). I don't plan more work on this feature now, resolving.

          Show
          Dag H. Wanvik added a comment - Committed 1074789 (derby-4741-fix-compilation). I don't plan more work on this feature now, resolving.
          Hide
          Lily Wei added a comment -

          Thank you so much, Dag. I couldn’t agree more. We should define the dependencies better and move toward to more fine rule in the build files and have easier life every time we need to change thing to make Derby better. A lot of things appear to be easy but take a lot of time to do once you get all the details. :- The devil is always in the detail.

          Show
          Lily Wei added a comment - Thank you so much, Dag. I couldn’t agree more. We should define the dependencies better and move toward to more fine rule in the build files and have easier life every time we need to change thing to make Derby better. A lot of things appear to be easy but take a lot of time to do once you get all the details. :- The devil is always in the detail.
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Dag H. Wanvik
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development