Derby
  1. Derby
  2. DERBY-5552

Derby threads hanging when using ClientXADataSource and a deadlock or lock timeout occurs

    Details

    • Urgency:
      Urgent
    • Issue & fix info:
      Patch Available, Repro attached
    • Bug behavior facts:
      Crash, Seen in production

      Description

      The issue arrives when multiple XA transactions are done in parallel and there is either a lock timeout or a lock deadlock detected. When this happens the connection is leaked in the Glassfish connection pool and the client thread hangs in "org.apache.derby.client.netReply.fill(Reply.java:172)".

      Shutting down the app server fails because the thread has a lock in "org.apache.derby.client.net.NetConnection40" and another task is calling "org.apache.derby.client.ClientPooledConnection.close(ClientPooledConnection.java:214)" which is waiting for the lock.

      Killing the appsever using "kill" and then attempting to shutdown Derby network server causes the Network Server to hang. One of the threads hangs waiting for a lock at "org.apache.derby.impl.drda.NeworkServerControlImpl.removeFromSessionTable(NetworkServerControlImpl.java:1525)" and the "main" thread has this locked at "org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(NetworkServerControlImpl.java:2242)" and it itself is waiting for a lock which belongs to a thread that is stuck at "org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(ActiveLock.java:118) which is in the TIMED_WAITING state.

      Only by killing the Network Server using "kill" is possible at this point.

      There are transactions left even though all clients have been removed.

      1. appserverstack.txt
        63 kB
        Brett Bergquist
      2. client.tar.Z
        69 kB
        Brett Bergquist
      3. derby.log
        320 kB
        Brett Bergquist
      4. derby-5552_withexpanded_test_diff.txt
        9 kB
        Kathey Marsden
      5. derby-5552_withtest_diff.txt
        5 kB
        Kathey Marsden
      6. derby-5552_withtest_diff.txt
        5 kB
        Kathey Marsden
      7. DERBY-5552-p1.patch
        2 kB
        Brett Bergquist
      8. DERBY-5552-p2.patch
        0.5 kB
        Brett Bergquist
      9. derbystackatshutdown.txt
        29 kB
        Brett Bergquist
      10. execute.patch
        1 kB
        Brett Bergquist
      11. ReproDerby5552DB2.java
        4 kB
        Kathey Marsden
      12. ReproDerby5552LockTimeout.java
        5 kB
        Kathey Marsden
      13. transactionsleft.txt
        4 kB
        Brett Bergquist
      14. utilXid.java
        2 kB
        Kathey Marsden

        Issue Links

          Activity

          Hide
          Brett Bergquist added a comment -

          Derby log file of showing the deadlock reporting and lock timeout reporting

          Show
          Brett Bergquist added a comment - Derby log file of showing the deadlock reporting and lock timeout reporting
          Hide
          Brett Bergquist added a comment -

          Client side trace files

          Show
          Brett Bergquist added a comment - Client side trace files
          Hide
          Brett Bergquist added a comment -

          The transactions left over when there is no client connected (other than IJ querying this)

          Show
          Brett Bergquist added a comment - The transactions left over when there is no client connected (other than IJ querying this)
          Hide
          Brett Bergquist added a comment -

          The application server stack trace from jstack when the application server is hung at shutdown.

          Show
          Brett Bergquist added a comment - The application server stack trace from jstack when the application server is hung at shutdown.
          Hide
          Brett Bergquist added a comment -

          The Network Server stack trace using jstack when hung at shutdown

          Show
          Brett Bergquist added a comment - The Network Server stack trace using jstack when hung at shutdown
          Hide
          Brett Bergquist added a comment -

          I found one problem. In BasicNoPutResultImpl.java around line 250 is the code:

          try {
          LanguageConnectionContext lcc = getLanguageConnectionContext();
          if(lcc.getRunTimeStatisticsMode() && lcc.getXplainOnlyMode())

          { // do nothing }

          else

          { openCore(); }

          } catch (StandardException se)

          { activation.checkStatementValidity(); throw se; }

          For whatever reason (looks like something in my code), a lock timeout is being detected while executing in “openCore” and a StandardException is being thrown with SQLState “40XL2” and a report = 2 (REPORT_ALWAYS I Believe). The exception is caught and then the

          Activation.checkStatementValidity()

          Is being called. The statement appears not to be valid at that point and a new StandardExceptis thrown with SQLState.LANG_STATEMENT_NEEDS_RECOMPILE and report = REPORT_NEVER.

          The effect is that the Lock timeout exception is swallowed and the statement is recompiled and executed again. This occurs over and over and the loop never ends. This looks like the cause that the Network Server cannot shutdown in the bug.

          It seems to me that if the exception being caught above is to be “REPORT_ALWAYS”, then the code should never check the statement validity and should just rethrow the original exception.

          Show
          Brett Bergquist added a comment - I found one problem. In BasicNoPutResultImpl.java around line 250 is the code: try { LanguageConnectionContext lcc = getLanguageConnectionContext(); if(lcc.getRunTimeStatisticsMode() && lcc.getXplainOnlyMode()) { // do nothing } else { openCore(); } } catch (StandardException se) { activation.checkStatementValidity(); throw se; } For whatever reason (looks like something in my code), a lock timeout is being detected while executing in “openCore” and a StandardException is being thrown with SQLState “40XL2” and a report = 2 (REPORT_ALWAYS I Believe). The exception is caught and then the Activation.checkStatementValidity() Is being called. The statement appears not to be valid at that point and a new StandardExceptis thrown with SQLState.LANG_STATEMENT_NEEDS_RECOMPILE and report = REPORT_NEVER. The effect is that the Lock timeout exception is swallowed and the statement is recompiled and executed again. This occurs over and over and the loop never ends. This looks like the cause that the Network Server cannot shutdown in the bug. It seems to me that if the exception being caught above is to be “REPORT_ALWAYS”, then the code should never check the statement validity and should just rethrow the original exception.
          Hide
          Brett Bergquist added a comment -

          I patched the files to look like this and am running my test setup now. So far so good. In fact I am not seeing the deadlocks or lock timeouts at all now. Not sure why but it is all running smoothly with these changes:

          Index: BasicNoPutResultSetImpl.java
          ===================================================================
          — BasicNoPutResultSetImpl.java (revision 1222480)
          +++ BasicNoPutResultSetImpl.java (working copy)
          @@ -256,7 +256,9 @@
          }

          } catch (StandardException se) {

          • activation.checkStatementValidity();
            + if (se.report() != StandardException.REPORT_ALWAYS) { + activation.checkStatementValidity(); + }

            throw se;
            }

          Index: DeleteResultSet.java
          ===================================================================
          — DeleteResultSet.java (revision 1222480)
          +++ DeleteResultSet.java (working copy)
          @@ -201,7 +201,9 @@
          source.reopenCore();
          }
          } catch (StandardException se) {

          • activation.checkStatementValidity();
            + if (se.report() != se.REPORT_ALWAYS) { + activation.checkStatementValidity(); + }

            throw se;

          }
          Index: RowChangerImpl.java
          ===================================================================
          — RowChangerImpl.java (revision 1222480)
          +++ RowChangerImpl.java (working copy)
          @@ -372,7 +372,7 @@
          }

          } catch (StandardException se)

          { - if (activation != null) + if (se.report() != StandardException.REPORT_ALWAYS && activation != null) activation.checkStatementValidity(); throw se; }
          Show
          Brett Bergquist added a comment - I patched the files to look like this and am running my test setup now. So far so good. In fact I am not seeing the deadlocks or lock timeouts at all now. Not sure why but it is all running smoothly with these changes: Index: BasicNoPutResultSetImpl.java =================================================================== — BasicNoPutResultSetImpl.java (revision 1222480) +++ BasicNoPutResultSetImpl.java (working copy) @@ -256,7 +256,9 @@ } } catch (StandardException se) { activation.checkStatementValidity(); + if (se.report() != StandardException.REPORT_ALWAYS) { + activation.checkStatementValidity(); + } throw se; } Index: DeleteResultSet.java =================================================================== — DeleteResultSet.java (revision 1222480) +++ DeleteResultSet.java (working copy) @@ -201,7 +201,9 @@ source.reopenCore(); } } catch (StandardException se) { activation.checkStatementValidity(); + if (se.report() != se.REPORT_ALWAYS) { + activation.checkStatementValidity(); + } throw se; } Index: RowChangerImpl.java =================================================================== — RowChangerImpl.java (revision 1222480) +++ RowChangerImpl.java (working copy) @@ -372,7 +372,7 @@ } } catch (StandardException se) { - if (activation != null) + if (se.report() != StandardException.REPORT_ALWAYS && activation != null) activation.checkStatementValidity(); throw se; }
          Hide
          Brett Bergquist added a comment -

          Here is the patch that I did to the 10.8 branch. Basically all I did is to make sure not to swallow the exceptions if the

          StandardException.report() == StandardException.REPORT_ALWAYS

          Show
          Brett Bergquist added a comment - Here is the patch that I did to the 10.8 branch. Basically all I did is to make sure not to swallow the exceptions if the StandardException.report() == StandardException.REPORT_ALWAYS
          Hide
          Kathey Marsden added a comment -

          Hi Brett,

          I am glad you are making progress as I was clearly misunderstanding your scenario.

          Although the residual xa transactions are there in the database, I looked at the client traces and don't see any indication that XA transactions are being used in the test in fact even the names of the the trace files driver? indicate a driver manager connection and not an xa datasource connection which would have been created in files xads?. Since there were no XA transactions, there was not the SYNCCTL with possible missing SYNCCRD exchanges that I was looking for.

          Your logic on the above change sounds reasonable to me, but I don't think I am the best person to review that change. After you run tests, post a patch and mark patch available and I am sure the community will review and expedite commit of the change if it is the right thing. I think it would definitely be good to get a test checked in too so we don't regress.

          Show
          Kathey Marsden added a comment - Hi Brett, I am glad you are making progress as I was clearly misunderstanding your scenario. Although the residual xa transactions are there in the database, I looked at the client traces and don't see any indication that XA transactions are being used in the test in fact even the names of the the trace files driver ? indicate a driver manager connection and not an xa datasource connection which would have been created in files xads ?. Since there were no XA transactions, there was not the SYNCCTL with possible missing SYNCCRD exchanges that I was looking for. Your logic on the above change sounds reasonable to me, but I don't think I am the best person to review that change. After you run tests, post a patch and mark patch available and I am sure the community will review and expedite commit of the change if it is the right thing. I think it would definitely be good to get a test checked in too so we don't regress.
          Hide
          Brett Bergquist added a comment -

          I guess I am confused as well Kathey as I had the debugger attached and do see it going through the XA code in Derby on the client side. The application server is setup with the ClientXADataSource and I do see it calling xa.commit and xa.end for example. The ClientXADataSource is required otherwise the error:

          Local transaction already has 1 non-XA Resource: cannot add more resources.

          occurs. So although there is one database (Derby), it is using XA. The database is being accessed through EJB's and through Eclipselink and also through a custom JCA interface driving Message Driven Beans.

          For the test case, I had to limit things to get my sanity. So I stopped as much access to the database as I could but still trigger the problem. Eventually I got down to one thread of control being processed by EJB's which do start new transactions. Even with this one access going on, I hit the lockup issue that I posted. That is when I found the issue that I mention. So whether or not this is the real issue, I don't know but when I tried to get as simple a condition as possible, I ran into this.

          Thinking now, I don't understand why this would not be hit in a normal case of a lock timeout being thrown. The only thing that I can think of is that the Activation.checkStatementValidity() is seeing the statement as valid and not going to try to recompile it. Why it occurred in my case where I see the "isValid" member set to false, I don't know. I will try to hitch up the debugger and try to determine the difference so that I can understand it better.

          I do believe that the code should not swallow and exception such as a lock timeout being reported regardless if the statement is no longer reporting to be valid. This is definitely a condition that will cause an infinite loop of processing.

          Again, I appreciate the help and your time. If I gain an understanding of how the condition is triggered, I will look to write a test case for it. I am reading the Derby testing docs that are relating to use JUnit which I assume is the correct path for newer test cases, correct?

          Show
          Brett Bergquist added a comment - I guess I am confused as well Kathey as I had the debugger attached and do see it going through the XA code in Derby on the client side. The application server is setup with the ClientXADataSource and I do see it calling xa.commit and xa.end for example. The ClientXADataSource is required otherwise the error: Local transaction already has 1 non-XA Resource: cannot add more resources. occurs. So although there is one database (Derby), it is using XA. The database is being accessed through EJB's and through Eclipselink and also through a custom JCA interface driving Message Driven Beans. For the test case, I had to limit things to get my sanity. So I stopped as much access to the database as I could but still trigger the problem. Eventually I got down to one thread of control being processed by EJB's which do start new transactions. Even with this one access going on, I hit the lockup issue that I posted. That is when I found the issue that I mention. So whether or not this is the real issue, I don't know but when I tried to get as simple a condition as possible, I ran into this. Thinking now, I don't understand why this would not be hit in a normal case of a lock timeout being thrown. The only thing that I can think of is that the Activation.checkStatementValidity() is seeing the statement as valid and not going to try to recompile it. Why it occurred in my case where I see the "isValid" member set to false, I don't know. I will try to hitch up the debugger and try to determine the difference so that I can understand it better. I do believe that the code should not swallow and exception such as a lock timeout being reported regardless if the statement is no longer reporting to be valid. This is definitely a condition that will cause an infinite loop of processing. Again, I appreciate the help and your time. If I gain an understanding of how the condition is triggered, I will look to write a test case for it. I am reading the Derby testing docs that are relating to use JUnit which I assume is the correct path for newer test cases, correct?
          Hide
          Kathey Marsden added a comment -

          Yes, definitely you want your test to be the new Junit style so that is the place to look.

          I think I have solved the mystery of the missing XA client traces and filed DERBY-5553 for that issue.

          I am trying to think of a scenario where we we execute a valid statement would get a lock timeout and then at the time of processing the exception the statement would have become invalid. One reason for a statement becoming invalid would be for instance if a table had been altered, for example if we had a prepared statement "SELECT * FROM TAB" and then a column was added or dropped. In that scenario the statement would be recompiled long before the lock timeout occurred. If the problem statement is the one with the lock timeout it would seem to be:

          SELECT DISTINCT t3.ID, t3.OPLOCK FROM PKG_9145E10G.PLPM_PAIR t3, PKG_9145E10G.PLPM_CHASSIS_9145E10G_USER t2, CORE_V1.DEVICE_ENTITY_USER t1, CORE_V1.DEVICE_ENTITY t0 WHERE ((t0.ID = CAST (? AS INTEGER )) AND (((t2.PLPM_PAIR_ID = t3.ID) AND ((t2.ID = t1.ID) AND (t1.DTYPE = 'PlpmChassis9145E10GUser'))) AND (t0.ID = t1.DEVICE_ENTITY_ID))) with 1 parameters begin parameter #1: 249360 :end parameter
          ERROR 40XL2: A lock could not be obtained within the time requested. The lockTable dump is:
          Thu Dec 22 07:03:07 PST 2011

          I am not sure how that could have gone invalid between the initial execution and the lock time out.

          Show
          Kathey Marsden added a comment - Yes, definitely you want your test to be the new Junit style so that is the place to look. I think I have solved the mystery of the missing XA client traces and filed DERBY-5553 for that issue. I am trying to think of a scenario where we we execute a valid statement would get a lock timeout and then at the time of processing the exception the statement would have become invalid. One reason for a statement becoming invalid would be for instance if a table had been altered, for example if we had a prepared statement "SELECT * FROM TAB" and then a column was added or dropped. In that scenario the statement would be recompiled long before the lock timeout occurred. If the problem statement is the one with the lock timeout it would seem to be: SELECT DISTINCT t3.ID, t3.OPLOCK FROM PKG_9145E10G.PLPM_PAIR t3, PKG_9145E10G.PLPM_CHASSIS_9145E10G_USER t2, CORE_V1.DEVICE_ENTITY_USER t1, CORE_V1.DEVICE_ENTITY t0 WHERE ((t0.ID = CAST (? AS INTEGER )) AND (((t2.PLPM_PAIR_ID = t3.ID) AND ((t2.ID = t1.ID) AND (t1.DTYPE = 'PlpmChassis9145E10GUser'))) AND (t0.ID = t1.DEVICE_ENTITY_ID))) with 1 parameters begin parameter #1: 249360 :end parameter ERROR 40XL2: A lock could not be obtained within the time requested. The lockTable dump is: Thu Dec 22 07:03:07 PST 2011 I am not sure how that could have gone invalid between the initial execution and the lock time out.
          Hide
          Brett Bergquist added a comment -

          It is the IndexStatisticsDaemonImpl that is invalidating the statement. To verify this, I got my test case where it was in this loop. With my printouts I was seeing that the isValid was being set by the thread that was stuck right after the timeout occurred and then I immediately saw makeInvalid being called with action = 40 (UPDATE_STATISTICS). So I attached the debugger to the database engine and set a breakpoint in IndexStatisticsDaemonImpl and then set "daemonDisabled" to be "false". Immediately the lock timeout got reported and the loop terminated and my test case continued on the way it should.

          So the IndexStatisticsDaemonImpl has the ability to invalidate statements in use by other threads. In this case, it is being invoked right after lock timeout occurs and before the lock timeout exception is being processed. The code sees that the statement is invalid and swallows the lock exception and sets up to recompile and run the statement again.

          Show
          Brett Bergquist added a comment - It is the IndexStatisticsDaemonImpl that is invalidating the statement. To verify this, I got my test case where it was in this loop. With my printouts I was seeing that the isValid was being set by the thread that was stuck right after the timeout occurred and then I immediately saw makeInvalid being called with action = 40 (UPDATE_STATISTICS). So I attached the debugger to the database engine and set a breakpoint in IndexStatisticsDaemonImpl and then set "daemonDisabled" to be "false". Immediately the lock timeout got reported and the loop terminated and my test case continued on the way it should. So the IndexStatisticsDaemonImpl has the ability to invalidate statements in use by other threads. In this case, it is being invoked right after lock timeout occurs and before the lock timeout exception is being processed. The code sees that the statement is invalid and swallows the lock exception and sets up to recompile and run the statement again.
          Hide
          Brett Bergquist added a comment -

          If I disable the istat daemon, I still the the timeout but now the error is reported as it should and the transaction is aborted and my code continues. With the itstat deamon enabled, then the lock timeout is discovered and signaled by exception being thrown but between the time it gets caught and processed, the istat daemon has set the "isValid" property of the statement to "false". The original lock timeout exception is caught but then the

          Activation.checkStatementValidity()

          is called in the exception handler. This finds that the statement is not valid (isValid == false) and throws its own exception indicating that a recompilation is required, swallowing the original lock timeout exception in the process. This causes the existing statement to be rebuilt and re-executed but the locks are still present, the lock timeout will occur again, the lock timeout exception will be thrown, the istat daemon will set the "isValid" to false again, etc. The process just continues forever.

          The patch that I attached to the bug only calls

          Activation.checkStatementValidity()

          in the exception handler if the caught exception is not set to REPORT_ALWAYS. This seem okay and to make sense because if the code is handling an exception that is to be REPORT_ALWAYS, there is no sense looking for a statement that is invalid; just report the exception.

          Show
          Brett Bergquist added a comment - If I disable the istat daemon, I still the the timeout but now the error is reported as it should and the transaction is aborted and my code continues. With the itstat deamon enabled, then the lock timeout is discovered and signaled by exception being thrown but between the time it gets caught and processed, the istat daemon has set the "isValid" property of the statement to "false". The original lock timeout exception is caught but then the Activation.checkStatementValidity() is called in the exception handler. This finds that the statement is not valid (isValid == false) and throws its own exception indicating that a recompilation is required, swallowing the original lock timeout exception in the process. This causes the existing statement to be rebuilt and re-executed but the locks are still present, the lock timeout will occur again, the lock timeout exception will be thrown, the istat daemon will set the "isValid" to false again, etc. The process just continues forever. The patch that I attached to the bug only calls Activation.checkStatementValidity() in the exception handler if the caught exception is not set to REPORT_ALWAYS. This seem okay and to make sense because if the code is handling an exception that is to be REPORT_ALWAYS, there is no sense looking for a statement that is invalid; just report the exception.
          Hide
          Brett Bergquist added a comment -

          I have found the cause of the problem. When a lock timeout or deadlock is detected, the server calls XATransactionState.cleanupOnError. This looks like:

          public void cleanupOnError(Throwable t) {

          if (t instanceof StandardException) {

          StandardException se = (StandardException) t;

          if (se.getSeverity() >= ExceptionSeverity.SESSION_SEVERITY)

          { popMe(); return; }

          if (se.getSeverity() == ExceptionSeverity.TRANSACTION_SEVERITY) {

          synchronized (this)

          { // disable use of the connection until it is cleaned up. conn.setApplicationConnection(null); notifyAll(); associationState = TRO_FAIL; if (SQLState.DEADLOCK.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBDEADLOCK; else if (SQLState.LOCK_TIMEOUT.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBTIMEOUT; else rollbackOnlyCode = XAException.XA_RBOTHER; }

          }
          }
          }

          The problem is the line of code:

          conn.setApplicationConnection(null);

          The problem that occurs is on the client side, when the SQLException is received, it ends up calling Sqlca.getMessage() to retrieve the formatted exception message. This makes a call back down to the server on the connection and ends up calling EmbedStatement.checkStatus() and now the EmbedConnection has a null "applicationConnection" and a noCurrentConnection is throw. DRDA code that receives this exception in processing of Sqlca.getMessage() determines that there is a protocol error and disconnects from the server.

          The XA transaction that was in process never has "end" called on it and the XA transaction on the client side is now lost. Derby now has a XA transaction that will never end causing all kinds of havoc such as logging all new transactions in case the one lost ever does get rolled back. The file system fill up with transaction logs, restarting the database engine takes days, etc.

          I have commented out the above line and now the proper lock error is actually reported at the client. I don't know if there are any ramifications of doing so at this point however.

          Show
          Brett Bergquist added a comment - I have found the cause of the problem. When a lock timeout or deadlock is detected, the server calls XATransactionState.cleanupOnError. This looks like: public void cleanupOnError(Throwable t) { if (t instanceof StandardException) { StandardException se = (StandardException) t; if (se.getSeverity() >= ExceptionSeverity.SESSION_SEVERITY) { popMe(); return; } if (se.getSeverity() == ExceptionSeverity.TRANSACTION_SEVERITY) { synchronized (this) { // disable use of the connection until it is cleaned up. conn.setApplicationConnection(null); notifyAll(); associationState = TRO_FAIL; if (SQLState.DEADLOCK.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBDEADLOCK; else if (SQLState.LOCK_TIMEOUT.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBTIMEOUT; else rollbackOnlyCode = XAException.XA_RBOTHER; } } } } The problem is the line of code: conn.setApplicationConnection(null); The problem that occurs is on the client side, when the SQLException is received, it ends up calling Sqlca.getMessage() to retrieve the formatted exception message. This makes a call back down to the server on the connection and ends up calling EmbedStatement.checkStatus() and now the EmbedConnection has a null "applicationConnection" and a noCurrentConnection is throw. DRDA code that receives this exception in processing of Sqlca.getMessage() determines that there is a protocol error and disconnects from the server. The XA transaction that was in process never has "end" called on it and the XA transaction on the client side is now lost. Derby now has a XA transaction that will never end causing all kinds of havoc such as logging all new transactions in case the one lost ever does get rolled back. The file system fill up with transaction logs, restarting the database engine takes days, etc. I have commented out the above line and now the proper lock error is actually reported at the client. I don't know if there are any ramifications of doing so at this point however.
          Hide
          Brett Bergquist added a comment -

          This patch makes sure that if a StandardException is set to REPORT_ALWAYS, no check is made to see if the statement is invalid and rebuild; the exception is just thrown. This stops lock timeouts and deadlocks from being swallowed.

          Show
          Brett Bergquist added a comment - This patch makes sure that if a StandardException is set to REPORT_ALWAYS, no check is made to see if the statement is invalid and rebuild; the exception is just thrown. This stops lock timeouts and deadlocks from being swallowed.
          Hide
          Brett Bergquist added a comment -

          This patch removed the null'ing out of the application connection during cleanup from a lock timeout or lock deadlock. The application connection is needed because the client code calls back on the connection to retrieve the formatted exception message and if this application connection is null, then a NoCurrentConnection SQLException is thrown during this. This causes the client to see a DRDA protocol exception and terminate without ever ending an in progress XA transaction. The XA transaction is now hanging around in the database, the client side has lost track of it, and all manner of things go wrong from that point.

          Show
          Brett Bergquist added a comment - This patch removed the null'ing out of the application connection during cleanup from a lock timeout or lock deadlock. The application connection is needed because the client code calls back on the connection to retrieve the formatted exception message and if this application connection is null, then a NoCurrentConnection SQLException is thrown during this. This causes the client to see a DRDA protocol exception and terminate without ever ending an in progress XA transaction. The XA transaction is now hanging around in the database, the client side has lost track of it, and all manner of things go wrong from that point.
          Hide
          Kathey Marsden added a comment -

          I'll work on a test case for this issue and run the patch through regression tests. I wouldn't think a lock timeout wouldn't be a transaction level error but a deadlock would be.

          if (se.getSeverity() == ExceptionSeverity.TRANSACTION_SEVERITY) {

          synchronized (this) {
          // disable use of the connection until it is cleaned up.
          conn.setApplicationConnection(null);

          It makes sense that the connection is needed to retrieve the error message text, but clearly someone at some point thought it was important not to let anything happen on this connection until the transaction was cleaned up. It would be good to understand the scenarios for which this this code was introduced.

          Show
          Kathey Marsden added a comment - I'll work on a test case for this issue and run the patch through regression tests. I wouldn't think a lock timeout wouldn't be a transaction level error but a deadlock would be. if (se.getSeverity() == ExceptionSeverity.TRANSACTION_SEVERITY) { synchronized (this) { // disable use of the connection until it is cleaned up. conn.setApplicationConnection(null); It makes sense that the connection is needed to retrieve the error message text, but clearly someone at some point thought it was important not to let anything happen on this connection until the transaction was cleaned up. It would be good to understand the scenarios for which this this code was introduced.
          Hide
          Brett Bergquist added a comment -

          I checked the original code and this has been in there since the initial import as far as I can tell from Subversion's history.

          I also just check the code more. The cleanup is synchronous (I believe this is true), so that thread of control will not use the connection until it returns.

          As for other threads, each of the other methods are synchronized on the XATransactionState instance so no other thread will be allowed on the XATransactionState instance while the cleanup is in progress as it also synchronizes on the XATransactionState instance.

          There are 3 other places where conn.setApplicationConnection(null) is called but all three are in XATransactionState.end(), which again, is synchronized on the XATransactionState instance.

          Show
          Brett Bergquist added a comment - I checked the original code and this has been in there since the initial import as far as I can tell from Subversion's history. I also just check the code more. The cleanup is synchronous (I believe this is true), so that thread of control will not use the connection until it returns. As for other threads, each of the other methods are synchronized on the XATransactionState instance so no other thread will be allowed on the XATransactionState instance while the cleanup is in progress as it also synchronizes on the XATransactionState instance. There are 3 other places where conn.setApplicationConnection(null) is called but all three are in XATransactionState.end(), which again, is synchronized on the XATransactionState instance.
          Hide
          Dag H. Wanvik added a comment -

          What about getApplicationState which reads applicationConnection? For example, it seems this is called unsynchronized, e.g from EmbedResultSet#movePosition -> checkExecIfClosed -> getEmbedConnection().getApplicationConnection(). If applicationConnection is no nulled, could that break the logic in checkExecIfClosed? There are other cases of calls to getApplicationState which I didn't look at yet.

          Show
          Dag H. Wanvik added a comment - What about getApplicationState which reads applicationConnection? For example, it seems this is called unsynchronized, e.g from EmbedResultSet#movePosition -> checkExecIfClosed -> getEmbedConnection().getApplicationConnection(). If applicationConnection is no nulled, could that break the logic in checkExecIfClosed? There are other cases of calls to getApplicationState which I didn't look at yet.
          Hide
          Brett Bergquist added a comment -

          I guess where I am coming from is that the cleanup for a lock timeout or deadlock is NOT closing the connection nor should it be. The connection IS going to be used again right after the cleanup returns the error to retrieve the exception message text. So nulling out the connection is wrong the way it is. The other places where the connection is nulled out are when the XA end is being processed. Now the application level connection is no longer being used because XA has been signaled that there is no more interaction using the connection, so to me this make sense.

          One thing to remember is that the change is only affecting the XA transactions, not the general usagen.

          As for the other calls not being synchronized, maybe they should be but I don't know.

          Show
          Brett Bergquist added a comment - I guess where I am coming from is that the cleanup for a lock timeout or deadlock is NOT closing the connection nor should it be. The connection IS going to be used again right after the cleanup returns the error to retrieve the exception message text. So nulling out the connection is wrong the way it is. The other places where the connection is nulled out are when the XA end is being processed. Now the application level connection is no longer being used because XA has been signaled that there is no more interaction using the connection, so to me this make sense. One thing to remember is that the change is only affecting the XA transactions, not the general usagen. As for the other calls not being synchronized, maybe they should be but I don't know.
          Hide
          Kathey Marsden added a comment -

          The attached reproduction ReproDerby5552LockTimeout.java shows the loss of connection with a lock timeout and resulting protocol error. It doesn't show the hang, but as I understand it that might be a separate synchronization issue.
          $ java ReproDerby5552LockTimeout
          Wed Jan 04 15:34:10 PST 2012 : Apache Derby Network Server - 10.9.0.0 alpha - (1227344) started and ready to accept conn
          ections on port 1597
          Expected Exception java.sql.SQLTransactionRollbackException: DERBY SQL error: SQLCODE: -1, SQLSTATE: 40XL1, SQLERRMC: 40
          XL1
          Connection ok. got right value
          Exception in thread "main" java.sql.SQLNonTransientConnectionException: A network protocol error was encountered and the
          connection has been terminated: the requested command encountered an unarchitected and implementation-specific conditio
          n for which there was no architected message (additional information may be available in the derby.log file on the serve
          r)
          at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:71)
          at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:364)
          at org.apache.derby.client.am.Connection.createStatement(Connection.java:379)
          at org.apache.derby.client.am.LogicalConnection.createStatement(LogicalConnection.java:164)
          at ReproDerby5552LockTimeout.checkConn(ReproDerby5552LockTimeout.java:85)
          at ReproDerby5552LockTimeout.main(ReproDerby5552LockTimeout.java:66)
          Caused by: org.apache.derby.client.am.SqlException: A network protocol error was encountered and the connection has been
          terminated: the requested command encountered an unarchitected and implementation-specific condition for which there wa
          s no architected message (additional information may be available in the derby.log file on the server)
          at org.apache.derby.client.am.SqlException.copyAsUnchainedSQLException(SqlException.java:511)
          at org.apache.derby.client.am.Sqlca.chainDeferredExceptionsToAgentOrAsConnectionWarnings(Sqlca.java:346)
          at org.apache.derby.client.am.Sqlca.getJDBCMessage(Sqlca.java:308)
          at org.apache.derby.client.am.SqlException.getMessage(SqlException.java:408)
          at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:364)
          at org.apache.derby.client.am.Statement.executeQuery(Statement.java:486)
          at ReproDerby5552LockTimeout.main(ReproDerby5552LockTimeout.java:57)

          Show
          Kathey Marsden added a comment - The attached reproduction ReproDerby5552LockTimeout.java shows the loss of connection with a lock timeout and resulting protocol error. It doesn't show the hang, but as I understand it that might be a separate synchronization issue. $ java ReproDerby5552LockTimeout Wed Jan 04 15:34:10 PST 2012 : Apache Derby Network Server - 10.9.0.0 alpha - (1227344) started and ready to accept conn ections on port 1597 Expected Exception java.sql.SQLTransactionRollbackException: DERBY SQL error: SQLCODE: -1, SQLSTATE: 40XL1, SQLERRMC: 40 XL1 Connection ok. got right value Exception in thread "main" java.sql.SQLNonTransientConnectionException: A network protocol error was encountered and the connection has been terminated: the requested command encountered an unarchitected and implementation-specific conditio n for which there was no architected message (additional information may be available in the derby.log file on the serve r) at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:71) at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:364) at org.apache.derby.client.am.Connection.createStatement(Connection.java:379) at org.apache.derby.client.am.LogicalConnection.createStatement(LogicalConnection.java:164) at ReproDerby5552LockTimeout.checkConn(ReproDerby5552LockTimeout.java:85) at ReproDerby5552LockTimeout.main(ReproDerby5552LockTimeout.java:66) Caused by: org.apache.derby.client.am.SqlException: A network protocol error was encountered and the connection has been terminated: the requested command encountered an unarchitected and implementation-specific condition for which there wa s no architected message (additional information may be available in the derby.log file on the server) at org.apache.derby.client.am.SqlException.copyAsUnchainedSQLException(SqlException.java:511) at org.apache.derby.client.am.Sqlca.chainDeferredExceptionsToAgentOrAsConnectionWarnings(Sqlca.java:346) at org.apache.derby.client.am.Sqlca.getJDBCMessage(Sqlca.java:308) at org.apache.derby.client.am.SqlException.getMessage(SqlException.java:408) at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:364) at org.apache.derby.client.am.Statement.executeQuery(Statement.java:486) at ReproDerby5552LockTimeout.main(ReproDerby5552LockTimeout.java:57)
          Hide
          Dag H. Wanvik added a comment -

          Tried the repro, I had to replace "new utilXid" with XATestUtil.getXid calls, then it worked. FYI, the timeout property is misspelt as "derby.locks.locktimeout", I had to change it to "derby.locks.waitTimeout", not that it matters, I just was suprprised I had to wait to long (default is 60s) for the timeout to happen.

          Interestingly, if I use EmbeddedXADataSource instead of client ditto, the following call throws:

          checkConn(conn2);

          again, due to the error cleanup having nulled out
          "applicationConnection", cf. in this stack trace the frame
          EmbedStatement#getConnection which has this code:

          java.sql.Connection appConn = getEmbedConnection().getApplicationConnection();

          if ((appConn != applicationConnection) || (appConn == null)) {
          throw Util.noCurrentConnection();

          ---------------------------------------------------------------- repro output
          Expected Exception java.sql.SQLTransactionRollbackException: A
          lock could not be obtained within the time requested Connection
          ok. got right value Exception in thread "main"
          java.sql.SQLNonTransientConnectionException: No current connection.
          at
          org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:77)
          at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:142)
          at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:158)
          at org.apache.derby.impl.jdbc.Util.noCurrentConnection(Util.java:262)
          at
          org.apache.derby.impl.jdbc.EmbedStatement.getConnection(EmbedStatement.java:1039)
          at
          org.apache.derby.iapi.jdbc.BrokeredStatement.controlCheck(BrokeredStatement.java:525)
          at
          org.apache.derby.iapi.jdbc.BrokeredStatement.getResultSetHoldability(BrokeredStatement.java:469)
          at
          org.apache.derby.iapi.jdbc.BrokeredStatement.<init>(BrokeredStatement.java:63)
          at
          org.apache.derby.iapi.jdbc.BrokeredStatement40.<init>(BrokeredStatement40.java:37)
          at
          org.apache.derby.iapi.jdbc.BrokeredConnection40.newBrokeredStatement(BrokeredConnection40.java:260)
          at
          org.apache.derby.jdbc.XAStatementControl.<init>(XAStatementControl.java:64)
          at
          org.apache.derby.jdbc.EmbedXAConnection.wrapStatement(EmbedXAConnection.java:192)
          at
          org.apache.derby.iapi.jdbc.BrokeredConnection.createStatement(BrokeredConnection.java:100)
          at derby5552repro.Derby5552Repro.checkConn(Derby5552Repro.java:96) at
          derby5552repro.Derby5552Repro.main(Derby5552Repro.java:76) Caused by:
          java.sql.SQLException: No current connection. at
          org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:42)
          at
          org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:122)
          at
          org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:71)
          ... 14 more Java Result: 1 BUILD SUCCESSFUL (total time: 21 minutes 51
          seconds)

          Show
          Dag H. Wanvik added a comment - Tried the repro, I had to replace "new utilXid" with XATestUtil.getXid calls, then it worked. FYI, the timeout property is misspelt as "derby.locks.locktimeout", I had to change it to "derby.locks.waitTimeout", not that it matters, I just was suprprised I had to wait to long (default is 60s) for the timeout to happen. Interestingly, if I use EmbeddedXADataSource instead of client ditto, the following call throws: checkConn(conn2); again, due to the error cleanup having nulled out "applicationConnection", cf. in this stack trace the frame EmbedStatement#getConnection which has this code: java.sql.Connection appConn = getEmbedConnection().getApplicationConnection(); if ((appConn != applicationConnection) || (appConn == null)) { throw Util.noCurrentConnection(); ---------------------------------------------------------------- repro output Expected Exception java.sql.SQLTransactionRollbackException: A lock could not be obtained within the time requested Connection ok. got right value Exception in thread "main" java.sql.SQLNonTransientConnectionException: No current connection. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:77) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:142) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:158) at org.apache.derby.impl.jdbc.Util.noCurrentConnection(Util.java:262) at org.apache.derby.impl.jdbc.EmbedStatement.getConnection(EmbedStatement.java:1039) at org.apache.derby.iapi.jdbc.BrokeredStatement.controlCheck(BrokeredStatement.java:525) at org.apache.derby.iapi.jdbc.BrokeredStatement.getResultSetHoldability(BrokeredStatement.java:469) at org.apache.derby.iapi.jdbc.BrokeredStatement.<init>(BrokeredStatement.java:63) at org.apache.derby.iapi.jdbc.BrokeredStatement40.<init>(BrokeredStatement40.java:37) at org.apache.derby.iapi.jdbc.BrokeredConnection40.newBrokeredStatement(BrokeredConnection40.java:260) at org.apache.derby.jdbc.XAStatementControl.<init>(XAStatementControl.java:64) at org.apache.derby.jdbc.EmbedXAConnection.wrapStatement(EmbedXAConnection.java:192) at org.apache.derby.iapi.jdbc.BrokeredConnection.createStatement(BrokeredConnection.java:100) at derby5552repro.Derby5552Repro.checkConn(Derby5552Repro.java:96) at derby5552repro.Derby5552Repro.main(Derby5552Repro.java:76) Caused by: java.sql.SQLException: No current connection. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:42) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:122) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:71) ... 14 more Java Result: 1 BUILD SUCCESSFUL (total time: 21 minutes 51 seconds)
          Hide
          Kathey Marsden added a comment -

          Here is Brett's patch with a regression test. I am still not clear on the reason that the application connection was being set to null. There are other instances in the same files where this same approach is used on failure. The connection seems still perfectly valid for use after the rollback.

          Show
          Kathey Marsden added a comment - Here is Brett's patch with a regression test. I am still not clear on the reason that the application connection was being set to null. There are other instances in the same files where this same approach is used on failure. The connection seems still perfectly valid for use after the rollback.
          Hide
          Kathey Marsden added a comment -

          reattaching patch to make minor comment correction. I still would like someone more familiar with this area to look and see removing the nulling of the connection is going to cause problems in some scenarios and what those scenarios might be.

          Show
          Kathey Marsden added a comment - reattaching patch to make minor comment correction. I still would like someone more familiar with this area to look and see removing the nulling of the connection is going to cause problems in some scenarios and what those scenarios might be.
          Hide
          Mike Matrigali added a comment -

          i tried to review the change, but am not familar with the assumptions at this level. I agree with your concern that previous comment
          seems to be indicating it wants to disable the connection. The comment "disable use of the connection until it is cleaned up.", seems
          to indicate maybe it meant to reenable it after the work of:
          notifyAll();
          associationState = TRO_FAIL;
          if (SQLState.DEADLOCK.equals(se.getMessageId()))
          rollbackOnlyCode = XAException.XA_RBDEADLOCK;
          else if (se.isLockTimeout())
          rollbackOnlyCode = XAException.XA_RBTIMEOUT;
          else
          rollbackOnlyCode = XAException.XA_RBOTHER;

          I reviewed the other comments in the issue, and agree with Brett's analysis of calls to cleanupOnError, and the single thread nature.

          Does anyone think it worthwhile to null out the connection, perform the above code, and then reset the connection?

          Show
          Mike Matrigali added a comment - i tried to review the change, but am not familar with the assumptions at this level. I agree with your concern that previous comment seems to be indicating it wants to disable the connection. The comment "disable use of the connection until it is cleaned up.", seems to indicate maybe it meant to reenable it after the work of: notifyAll(); associationState = TRO_FAIL; if (SQLState.DEADLOCK.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBDEADLOCK; else if (se.isLockTimeout()) rollbackOnlyCode = XAException.XA_RBTIMEOUT; else rollbackOnlyCode = XAException.XA_RBOTHER; I reviewed the other comments in the issue, and agree with Brett's analysis of calls to cleanupOnError, and the single thread nature. Does anyone think it worthwhile to null out the connection, perform the above code, and then reset the connection?
          Hide
          Brett Bergquist added a comment -

          It would be great to have picked the mind of the author of the comment and code but it appeared this was this way since the beginning. I don't think it would hurt to do what you suggest but just to let you know, this code is not in production being exercised extensively (private build with this in it as I had a need to fix the problem now) with no issues found. Just some empirical testing results.

          Thanks for taking the time to look at it.

          Brett

          Show
          Brett Bergquist added a comment - It would be great to have picked the mind of the author of the comment and code but it appeared this was this way since the beginning. I don't think it would hurt to do what you suggest but just to let you know, this code is not in production being exercised extensively (private build with this in it as I had a need to fix the problem now) with no issues found. Just some empirical testing results. Thanks for taking the time to look at it. Brett
          Hide
          Kathey Marsden added a comment -

          Triage for 10.9.
          This issue is in progress and should be committed soon.

          I think I am going to just double check that subsequent operations on the connection start a new local transaction and that the old global transaction is really dead. The test confirms we cannot do an XA end but should probably also check the transaction table. If all that checks out I will check in the change as is.

          Thanks Mike and Brett for the input.

          Show
          Kathey Marsden added a comment - Triage for 10.9. This issue is in progress and should be committed soon. I think I am going to just double check that subsequent operations on the connection start a new local transaction and that the old global transaction is really dead. The test confirms we cannot do an XA end but should probably also check the transaction table. If all that checks out I will check in the change as is. Thanks Mike and Brett for the input.
          Hide
          Kathey Marsden added a comment -

          Attached is a new patch derby-5552_withexpanded_test_diff.txt which expands the test to check the state of the connections and the transaction table after the lock timeout.

          I think most things are working as expected.

          • The transaction (xid2) that was rolled back due the lock timeout is no longer in the transaction table after the lock time out.
          • New statements on the connection after the lock timeout start a new local transaction.
          • XAResource.end() on xid2 fails with an RB_TIMEOUT
          • Even though it is not in the transaction table and is holding no locks, xid2 has to be explicitly rolled back before it can be reused. This is the one I am not sure about. Should this be necessary?
          Show
          Kathey Marsden added a comment - Attached is a new patch derby-5552_withexpanded_test_diff.txt which expands the test to check the state of the connections and the transaction table after the lock timeout. I think most things are working as expected. The transaction (xid2) that was rolled back due the lock timeout is no longer in the transaction table after the lock time out. New statements on the connection after the lock timeout start a new local transaction. XAResource.end() on xid2 fails with an RB_TIMEOUT Even though it is not in the transaction table and is holding no locks, xid2 has to be explicitly rolled back before it can be reused. This is the one I am not sure about. Should this be necessary?
          Hide
          Kathey Marsden added a comment -

          Running derbyall I see one failure a test specifically for this case expecting no connection.
          $ cat xaSimpleNegative.diff

              • Start: xaSimpleNegative jdk1.7.0 derbyall:xa 2012-02-23 14:23:29 ***
                202 del
                < ERROR 08003: No current connection.
                202a202,204
                > A |B
                > ----------------------
                > ERROR 40XL1: A lock could not be obtained within the time requested
                Test Failed.
              • End: xaSimpleNegative jdk1.7.0 derbyall:xa 2012-02-23 14:23:45 ***

          Specifically the master xaSimpleNegative.out shows we were expecting the connection to go away after the lock timeout. It also shows that we expected to be able to still explicitly rollback the transaction. I think that is not something that would be persistent as the transaction is no longer in the transaction table.

          ij(XA)> – ERROR: deadlock, transaction trashed
          select * from APP.negative;
          A |B
          ----------------------
          ERROR 40XL1: A lock could not be obtained within the time requested
          ij(XA)> – ERROR: should have no connection underneath
          select * from APP.negative;
          ERROR 08003: No current connection.
          ij(XA)> – ERROR: should have no connection underneath and xid 2 is gone
          xa_end xa_suspend 2;
          IJ ERROR: XA_RBTIMEOUT
          ij(XA)> – ERROR: should have no connection underneath and xid 2 is gone
          xa_end xa_fail 2;
          IJ ERROR: XA_RBTIMEOUT
          ij(XA)> xa_rollback 2;
          ij(XA)> disconnect;

          So I am not sure exactly what to think. Maybe it would be worthwhile to see what other databases do.

          As an aside in my google research, I noticed I actually asked about this back in 2004!
          http://mail-archives.apache.org/mod_mbox/db-derby-dev/200411.mbox/%3C4186AA22.7000305@Sourcery.Org%3E

          Show
          Kathey Marsden added a comment - Running derbyall I see one failure a test specifically for this case expecting no connection. $ cat xaSimpleNegative.diff Start: xaSimpleNegative jdk1.7.0 derbyall:xa 2012-02-23 14:23:29 *** 202 del < ERROR 08003: No current connection. 202a202,204 > A |B > ---------------------- > ERROR 40XL1: A lock could not be obtained within the time requested Test Failed. End: xaSimpleNegative jdk1.7.0 derbyall:xa 2012-02-23 14:23:45 *** Specifically the master xaSimpleNegative.out shows we were expecting the connection to go away after the lock timeout. It also shows that we expected to be able to still explicitly rollback the transaction. I think that is not something that would be persistent as the transaction is no longer in the transaction table. ij(XA)> – ERROR: deadlock, transaction trashed select * from APP.negative; A |B ---------------------- ERROR 40XL1: A lock could not be obtained within the time requested ij(XA)> – ERROR: should have no connection underneath select * from APP.negative; ERROR 08003: No current connection. ij(XA)> – ERROR: should have no connection underneath and xid 2 is gone xa_end xa_suspend 2; IJ ERROR: XA_RBTIMEOUT ij(XA)> – ERROR: should have no connection underneath and xid 2 is gone xa_end xa_fail 2; IJ ERROR: XA_RBTIMEOUT ij(XA)> xa_rollback 2; ij(XA)> disconnect; So I am not sure exactly what to think. Maybe it would be worthwhile to see what other databases do. As an aside in my google research, I noticed I actually asked about this back in 2004! http://mail-archives.apache.org/mod_mbox/db-derby-dev/200411.mbox/%3C4186AA22.7000305@Sourcery.Org%3E
          Hide
          Kathey Marsden added a comment -

          Well, I tried this out with DB2 and things are interesting and different. The first thing of note is that the default for DB2 is no lock timeout. I had to go into Control Center and change the database application parameter LOCKTIMEOUT to be something other than -1.

          With DB2 after the lock timeout DB2 doesn't let you use the connection saying
          Exception trying to check conn2 after lock timeout but before explicit rollback
          com.ibm.db2.jcc.am.SqlException: [jcc][t4][10342][11669][4.12.55] Application must execute a rollback. The unit of w
          has already been rolled back in the database but other resource managers involved in this unit of work might not. To
          ure integrity of this application, all SQL requests will be rejected until the application issues a rollback. ERROR
          =-4497, SQLSTATE=null

          Then it won't actually let you do a rollback, but will let you do an end. If you try rollback it fails with XAER_PROTO but a rollback after end fails with XAER_NOTA if you try to rollback after end. Statements after the end fail with XA_RBTIMEOUT so I am not sure actually how one could use the connection again.

          Below is the output and attached is the program I was playing with in case anyone is interested, but I think the main thing I have learned is that since the spec doesn't really offer any guidance on this, we just need to do something reasonable and make sure that the global transaction doesn't get used again except to rollback, which I think the behavior with the patch does effectively as the activity that happens on the connection after the implicit rollback happens in a new local transaction.
          So I will go ahead and check in the latest patch with the expanded test.

          $ java -Duser=xxxxx -Dpassword=xxx ReproDerby5552DB2
          Got Expected Lock Timeout Exception DB2 SQL Error: SQLCODE=-911, SQLSTATE=40001, SQLERRMC=68, DRIVER=4.12.55
          Connection ok. got right value
          Exception trying to check conn2 after lock timeout but before explicit rollback
          com.ibm.db2.jcc.am.SqlException: [jcc][t4][10342][11669][4.12.55] Application must execute a rollback. The unit of work
          has already been rolled back in the database but other resource managers involved in this unit of work might not. To ens
          ure integrity of this application, all SQL requests will be rejected until the application issues a rollback. ERRORCODE
          =-4497, SQLSTATE=null
          at com.ibm.db2.jcc.am.hd.a(hd.java:660)
          at com.ibm.db2.jcc.am.hd.a(hd.java:60)
          at com.ibm.db2.jcc.am.hd.a(hd.java:75)
          at com.ibm.db2.jcc.am.o.f(o.java:537)
          at com.ibm.db2.jcc.am.o.a(o.java:495)
          at com.ibm.db2.jcc.am.Sqlca.getJDBCMessage(Sqlca.java:334)
          at com.ibm.db2.jcc.am.SqlExceptionContainer.getMessage(SqlExceptionContainer.java:78)
          at com.ibm.db2.jcc.am.SqlTransactionRollbackException.getMessage(SqlTransactionRollbackException.java:52)
          at ReproDerby5552DB2.main(ReproDerby5552DB2.java:61)
          Did not get exception on end as with Derby
          Got Exception checking conn2 after end
          com.ibm.db2.jcc.am.SqlException: [jcc][t4][2041][11392][4.12.55] Error executing XAResource.end(). Server returned XA_R
          BTIMEOUT. ERRORCODE=-4203, SQLSTATE=null
          at com.ibm.db2.jcc.am.hd.a(hd.java:660)
          at com.ibm.db2.jcc.am.hd.a(hd.java:60)
          at com.ibm.db2.jcc.am.hd.a(hd.java:94)
          at com.ibm.db2.jcc.t4.zb.a(zb.java:2755)
          at com.ibm.db2.jcc.t4.c.Xb(c.java:271)
          at com.ibm.db2.jcc.am.o.g(o.java:340)
          at com.ibm.db2.jcc.t4.a.g(a.java:631)
          at com.ibm.db2.jcc.am.o.a(o.java:214)
          at com.ibm.db2.jcc.am.mn.a(mn.java:3073)
          at com.ibm.db2.jcc.am.mn.a(mn.java:686)
          at com.ibm.db2.jcc.am.mn.executeQuery(mn.java:670)
          at ReproDerby5552DB2.checkConn(ReproDerby5552DB2.java:106)
          at ReproDerby5552DB2.main(ReproDerby5552DB2.java:86)
          Rolling back xid2
          Exception in thread "main" com.ibm.db2.jcc.am.XaException: [jcc][t4][2041][12326][4.12.55] Error executing XAResource.ro
          llback(). Server returned XAER_NOTA. ERRORCODE=-4203, SQLSTATE=null
          at com.ibm.db2.jcc.am.hd.c(hd.java:453)
          at com.ibm.db2.jcc.t4.zb.b(zb.java:2773)
          at com.ibm.db2.jcc.t4.ac.b(ac.java:1546)
          at com.ibm.db2.jcc.t4.ac.a(ac.java:1326)
          at com.ibm.db2.jcc.t4.ac.a(ac.java:1321)
          at com.ibm.db2.jcc.t4.ac.rollback(ac.java:1310)
          at ReproDerby5552DB2.main(ReproDerby5552DB2.java:94)

          kmarsden@IBM-JDPM42DBIO2 ~/repro/derby-5552
          $

          $ java ReproDerby5552DB2
          Got Expected Lock Timeout Exception DB2 SQL Error: SQLCODE=-911, SQLSTATE=40001, SQLERRMC=6
          Connection ok. got right value
          Exception trying to check conn2 after lock timeout but before explicit rollback
          com.ibm.db2.jcc.am.SqlException: [jcc][t4][10342][11669][4.12.55] Application must execute
          has already been rolled back in the database but other resource managers involved in this u
          ure integrity of this application, all SQL requests will be rejected until the application
          =-4497, SQLSTATE=null
          at com.ibm.db2.jcc.am.hd.a(hd.java:660)
          at com.ibm.db2.jcc.am.hd.a(hd.java:60)
          at com.ibm.db2.jcc.am.hd.a(hd.java:75)
          at com.ibm.db2.jcc.am.o.f(o.java:537)
          at com.ibm.db2.jcc.am.o.a(o.java:495)
          at com.ibm.db2.jcc.am.Sqlca.getJDBCMessage(Sqlca.java:334)
          at com.ibm.db2.jcc.am.SqlExceptionContainer.getMessage(SqlExceptionContainer.java:7
          at com.ibm.db2.jcc.am.SqlTransactionRollbackException.getMessage(SqlTransactionRoll
          at ReproDerby5552DB2.main(ReproDerby5552DB2.java:61)
          Did not get exception on end as with Derby
          Rolling back xid2
          Exception in thread "main" com.ibm.db2.jcc.am.XaException: [jcc][t4][2041][12326][4.12.55]
          llback(). Server returned XAER_NOTA. ERRORCODE=-4203, SQLSTATE=null
          at com.ibm.db2.jcc.am.hd.c(hd.java:453)
          at com.ibm.db2.jcc.t4.zb.b(zb.java:2773)
          at com.ibm.db2.jcc.t4.ac.b(ac.java:1546)
          at com.ibm.db2.jcc.t4.ac.a(ac.java:1326)
          at com.ibm.db2.jcc.t4.ac.a(ac.java:1321)
          at com.ibm.db2.jcc.t4.ac.rollback(ac.java:1310)
          at ReproDerby5552DB2.main(ReproDerby5552DB2.java:86)
          Caused by: com.ibm.db2.jcc.am.XaException: [jcc][t4][2041][12326][4.12.55] Error executing
          urned XA_RBTIMEOUT. ERRORCODE=-4203, SQLSTATE=null
          at com.ibm.db2.jcc.am.hd.c(hd.java:453)
          at com.ibm.db2.jcc.t4.zb.b(zb.java:2773)
          at com.ibm.db2.jcc.t4.ac.b(ac.java:1520)
          ... 4 more

          kmarsden@IBM-JDPM42DBIO2 ~/repro/derby-5552
          $

          Show
          Kathey Marsden added a comment - Well, I tried this out with DB2 and things are interesting and different. The first thing of note is that the default for DB2 is no lock timeout. I had to go into Control Center and change the database application parameter LOCKTIMEOUT to be something other than -1. With DB2 after the lock timeout DB2 doesn't let you use the connection saying Exception trying to check conn2 after lock timeout but before explicit rollback com.ibm.db2.jcc.am.SqlException: [jcc] [t4] [10342] [11669] [4.12.55] Application must execute a rollback. The unit of w has already been rolled back in the database but other resource managers involved in this unit of work might not. To ure integrity of this application, all SQL requests will be rejected until the application issues a rollback. ERROR =-4497, SQLSTATE=null Then it won't actually let you do a rollback, but will let you do an end. If you try rollback it fails with XAER_PROTO but a rollback after end fails with XAER_NOTA if you try to rollback after end. Statements after the end fail with XA_RBTIMEOUT so I am not sure actually how one could use the connection again. Below is the output and attached is the program I was playing with in case anyone is interested, but I think the main thing I have learned is that since the spec doesn't really offer any guidance on this, we just need to do something reasonable and make sure that the global transaction doesn't get used again except to rollback, which I think the behavior with the patch does effectively as the activity that happens on the connection after the implicit rollback happens in a new local transaction. So I will go ahead and check in the latest patch with the expanded test. $ java -Duser=xxxxx -Dpassword=xxx ReproDerby5552DB2 Got Expected Lock Timeout Exception DB2 SQL Error: SQLCODE=-911, SQLSTATE=40001, SQLERRMC=68, DRIVER=4.12.55 Connection ok. got right value Exception trying to check conn2 after lock timeout but before explicit rollback com.ibm.db2.jcc.am.SqlException: [jcc] [t4] [10342] [11669] [4.12.55] Application must execute a rollback. The unit of work has already been rolled back in the database but other resource managers involved in this unit of work might not. To ens ure integrity of this application, all SQL requests will be rejected until the application issues a rollback. ERRORCODE =-4497, SQLSTATE=null at com.ibm.db2.jcc.am.hd.a(hd.java:660) at com.ibm.db2.jcc.am.hd.a(hd.java:60) at com.ibm.db2.jcc.am.hd.a(hd.java:75) at com.ibm.db2.jcc.am.o.f(o.java:537) at com.ibm.db2.jcc.am.o.a(o.java:495) at com.ibm.db2.jcc.am.Sqlca.getJDBCMessage(Sqlca.java:334) at com.ibm.db2.jcc.am.SqlExceptionContainer.getMessage(SqlExceptionContainer.java:78) at com.ibm.db2.jcc.am.SqlTransactionRollbackException.getMessage(SqlTransactionRollbackException.java:52) at ReproDerby5552DB2.main(ReproDerby5552DB2.java:61) Did not get exception on end as with Derby Got Exception checking conn2 after end com.ibm.db2.jcc.am.SqlException: [jcc] [t4] [2041] [11392] [4.12.55] Error executing XAResource.end(). Server returned XA_R BTIMEOUT. ERRORCODE=-4203, SQLSTATE=null at com.ibm.db2.jcc.am.hd.a(hd.java:660) at com.ibm.db2.jcc.am.hd.a(hd.java:60) at com.ibm.db2.jcc.am.hd.a(hd.java:94) at com.ibm.db2.jcc.t4.zb.a(zb.java:2755) at com.ibm.db2.jcc.t4.c.Xb(c.java:271) at com.ibm.db2.jcc.am.o.g(o.java:340) at com.ibm.db2.jcc.t4.a.g(a.java:631) at com.ibm.db2.jcc.am.o.a(o.java:214) at com.ibm.db2.jcc.am.mn.a(mn.java:3073) at com.ibm.db2.jcc.am.mn.a(mn.java:686) at com.ibm.db2.jcc.am.mn.executeQuery(mn.java:670) at ReproDerby5552DB2.checkConn(ReproDerby5552DB2.java:106) at ReproDerby5552DB2.main(ReproDerby5552DB2.java:86) Rolling back xid2 Exception in thread "main" com.ibm.db2.jcc.am.XaException: [jcc] [t4] [2041] [12326] [4.12.55] Error executing XAResource.ro llback(). Server returned XAER_NOTA. ERRORCODE=-4203, SQLSTATE=null at com.ibm.db2.jcc.am.hd.c(hd.java:453) at com.ibm.db2.jcc.t4.zb.b(zb.java:2773) at com.ibm.db2.jcc.t4.ac.b(ac.java:1546) at com.ibm.db2.jcc.t4.ac.a(ac.java:1326) at com.ibm.db2.jcc.t4.ac.a(ac.java:1321) at com.ibm.db2.jcc.t4.ac.rollback(ac.java:1310) at ReproDerby5552DB2.main(ReproDerby5552DB2.java:94) kmarsden@IBM-JDPM42DBIO2 ~/repro/derby-5552 $ $ java ReproDerby5552DB2 Got Expected Lock Timeout Exception DB2 SQL Error: SQLCODE=-911, SQLSTATE=40001, SQLERRMC=6 Connection ok. got right value Exception trying to check conn2 after lock timeout but before explicit rollback com.ibm.db2.jcc.am.SqlException: [jcc] [t4] [10342] [11669] [4.12.55] Application must execute has already been rolled back in the database but other resource managers involved in this u ure integrity of this application, all SQL requests will be rejected until the application =-4497, SQLSTATE=null at com.ibm.db2.jcc.am.hd.a(hd.java:660) at com.ibm.db2.jcc.am.hd.a(hd.java:60) at com.ibm.db2.jcc.am.hd.a(hd.java:75) at com.ibm.db2.jcc.am.o.f(o.java:537) at com.ibm.db2.jcc.am.o.a(o.java:495) at com.ibm.db2.jcc.am.Sqlca.getJDBCMessage(Sqlca.java:334) at com.ibm.db2.jcc.am.SqlExceptionContainer.getMessage(SqlExceptionContainer.java:7 at com.ibm.db2.jcc.am.SqlTransactionRollbackException.getMessage(SqlTransactionRoll at ReproDerby5552DB2.main(ReproDerby5552DB2.java:61) Did not get exception on end as with Derby Rolling back xid2 Exception in thread "main" com.ibm.db2.jcc.am.XaException: [jcc] [t4] [2041] [12326] [4.12.55] llback(). Server returned XAER_NOTA. ERRORCODE=-4203, SQLSTATE=null at com.ibm.db2.jcc.am.hd.c(hd.java:453) at com.ibm.db2.jcc.t4.zb.b(zb.java:2773) at com.ibm.db2.jcc.t4.ac.b(ac.java:1546) at com.ibm.db2.jcc.t4.ac.a(ac.java:1326) at com.ibm.db2.jcc.t4.ac.a(ac.java:1321) at com.ibm.db2.jcc.t4.ac.rollback(ac.java:1310) at ReproDerby5552DB2.main(ReproDerby5552DB2.java:86) Caused by: com.ibm.db2.jcc.am.XaException: [jcc] [t4] [2041] [12326] [4.12.55] Error executing urned XA_RBTIMEOUT. ERRORCODE=-4203, SQLSTATE=null at com.ibm.db2.jcc.am.hd.c(hd.java:453) at com.ibm.db2.jcc.t4.zb.b(zb.java:2773) at com.ibm.db2.jcc.t4.ac.b(ac.java:1520) ... 4 more kmarsden@IBM-JDPM42DBIO2 ~/repro/derby-5552 $
          Hide
          Kathey Marsden added a comment -

          Change fix version to 10.9. Leaving open for backport.

          Show
          Kathey Marsden added a comment - Change fix version to 10.9. Leaving open for backport.
          Hide
          Kathey Marsden added a comment -

          This issue has always been here. Changing affects version accordingly.

          Show
          Kathey Marsden added a comment - This issue has always been here. Changing affects version accordingly.
          Hide
          Kathey Marsden added a comment -

          link to testing task

          Show
          Kathey Marsden added a comment - link to testing task
          Hide
          Kathey Marsden added a comment -

          I backported the fix and test to 10.5. Reassigning to Brett who provided the code patch.
          Filed DERBY-5633 for the testing.

          Show
          Kathey Marsden added a comment - I backported the fix and test to 10.5. Reassigning to Brett who provided the code patch. Filed DERBY-5633 for the testing.
          Hide
          Kathey Marsden added a comment -

          Resolving. I only went back as far is 10.5 but it certainly would be appropriate to backport further if needed.

          Thanks Brett for tthe fix!

          Show
          Kathey Marsden added a comment - Resolving. I only went back as far is 10.5 but it certainly would be appropriate to backport further if needed. Thanks Brett for tthe fix!
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update: close all resolved issues that haven't had any activity the last year]

          Show
          Knut Anders Hatlen added a comment - [bulk update: close all resolved issues that haven't had any activity the last year]

            People

            • Assignee:
              Brett Wooldridge
              Reporter:
              Brett Bergquist
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development