Derby
  1. Derby
  2. DERBY-3980

Conflicting select then update with REPEATABLE_READ gives lock timeout instead of deadlock

    Details

    • Urgency:
      Normal
    • Issue & fix info:
      High Value Fix

      Description

      The attached program TryTimeout.java should detect a deadlock but instead throws a lock timeout exception. The program has two threads that attempt:

      threadConnection.setAutoCommit(false);
      /* set isolation level to repeatable read */
      threadConnection.setTransactionIsolation(Connection.TRANSACTION_REPEATABLE_READ);

      ResultSet rs = stmt.executeQuery("select * from t where i = 456");
      while (rs.next());
      stmt.executeUpdate("update t set i = 456 where i = 456");
      threadConnection.commit();

      This gives SQLState 40001 (deadlock) with DB2 but a lock timeout with Derby.

      1. derby.log
        19 kB
        Kathey Marsden
      2. derby.log.10_1
        19 kB
        Kathey Marsden
      3. derby-3980_javadoc_and_test_diff.txt
        8 kB
        Kathey Marsden
      4. derby-3980-1a.diff
        17 kB
        Knut Anders Hatlen
      5. javacore.20081209.092827.9800.txt
        177 kB
        Kathey Marsden
      6. LiveLockTest_diff.txt
        8 kB
        Kathey Marsden
      7. LiveLockTest_with_Deadlock_look_diff.txt
        2 kB
        Kathey Marsden
      8. LockTimeoutWithInserts.java
        8 kB
        Kathey Marsden
      9. TryTimeout.java
        4 kB
        Kathey Marsden
      10. TryTimeout2.java
        9 kB
        Kathey Marsden
      11. TryTimeout2.out.10_1.deadlock
        7 kB
        Kathey Marsden
      12. TryTimeout2.out.10_1.deadlock
        5 kB
        Kathey Marsden
      13. TryTimeout2.out.10_1.locktimeout
        19 kB
        Kathey Marsden
      14. TryTimeout2.out.10_1.locktimeout
        17 kB
        Kathey Marsden

        Issue Links

          Activity

          Hide
          Kathey Marsden added a comment -

          Attaching the 10.1 derby.log which shows two waiters instead of one.

          Also attaching the javacore from the trunk run as the threads are waiting to show where they are waiting.

          Show
          Kathey Marsden added a comment - Attaching the 10.1 derby.log which shows two waiters instead of one. Also attaching the javacore from the trunk run as the threads are waiting to show where they are waiting.
          Hide
          Mike Matrigali added a comment -

          Looking at the error log there are a few wierd things:
          1) I would expect to see 2 waiting locks.
          2) Given that it should be repeatable read, and there is no index I would expect the lock list to contain row locks for
          every row in the table. Maybe it would useful to dump the lock table after the select using the lock vti just to see if
          it matches. Maybe this will lead to whether the issue is the actual lock table, or just some issue with contruction the
          error message. (1, 9) is likely the row in question. The other rows are likely (1, 6), (1,7), and (1,8). The id's before
          6 are reserved for special use so real rows start at 6 and increase.
          3) Just to be complete it would be good to verify that both threads are in a wait state in the lock, I would just dump the
          stacks using signal or cntrl key depending on jvm while it is stalled. I looked at program and can't imagine anything else
          going on.
          4) Anyone who wants to look at this one, the code is pretty isolated. The interesting code is in
          opensource/java/engine/org/apache/derby/impl/services/locks/LockSet.java

          5) some debugging steps I would take would include:
          o verify if a deadlock search is being made or not?
          o run in sane mode and turn on the SanityManager.DEBUG_ON("DeadlockTrace") flag.
          o If a deadlock search is being made, use the sanity code to dump out the lock table when the deadlock search is done and
          misses the deadlock.
          o see if the bug still happens if there is only 1 row in the table.

          ERROR 40XL2: A lock could not be obtained within the time requested. The lockTable dump is:
          2008-12-09 00:50:40.640 GMT
          XID |TYPE |MODE|LOCKCOUNT|LOCKNAME |STATE|TABLETYPE / LOCKOBJ |INDEXNAME / CONTAINER_ID / (MODE for LATCH only) |TABLENAME / CONGLOM_ID |
          ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

              • The following row is the victim ***
                186 |ROW |U |0 |(1,9) |WAIT |T |NULL |T |
              • The above row is the victim ***
                184 |ROW |S |1 |(1,9) |GRANT|T |NULL |T |
                186 |ROW |S |1 |(1,9) |GRANT|T |NULL |T |
                184 |ROW |U |1 |(1,9) |GRANT|T |NULL |T |
                184 |TABLE |IS |1 |Tablelock |GRANT|T |NULL |T |
                186 |TABLE |IS |1 |Tablelock |GRANT|T |NULL |T |
                184 |TABLE |IX |2 |Tablelock |GRANT|T |NULL |T |
                186 |TABLE |IX |2 |Tablelock |GRANT|T |NULL |T |
                ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Show
          Mike Matrigali added a comment - Looking at the error log there are a few wierd things: 1) I would expect to see 2 waiting locks. 2) Given that it should be repeatable read, and there is no index I would expect the lock list to contain row locks for every row in the table. Maybe it would useful to dump the lock table after the select using the lock vti just to see if it matches. Maybe this will lead to whether the issue is the actual lock table, or just some issue with contruction the error message. (1, 9) is likely the row in question. The other rows are likely (1, 6), (1,7), and (1,8). The id's before 6 are reserved for special use so real rows start at 6 and increase. 3) Just to be complete it would be good to verify that both threads are in a wait state in the lock, I would just dump the stacks using signal or cntrl key depending on jvm while it is stalled. I looked at program and can't imagine anything else going on. 4) Anyone who wants to look at this one, the code is pretty isolated. The interesting code is in opensource/java/engine/org/apache/derby/impl/services/locks/LockSet.java 5) some debugging steps I would take would include: o verify if a deadlock search is being made or not? o run in sane mode and turn on the SanityManager.DEBUG_ON("DeadlockTrace") flag. o If a deadlock search is being made, use the sanity code to dump out the lock table when the deadlock search is done and misses the deadlock. o see if the bug still happens if there is only 1 row in the table. ERROR 40XL2: A lock could not be obtained within the time requested. The lockTable dump is: 2008-12-09 00:50:40.640 GMT XID |TYPE |MODE|LOCKCOUNT|LOCKNAME |STATE|TABLETYPE / LOCKOBJ |INDEXNAME / CONTAINER_ID / (MODE for LATCH only) |TABLENAME / CONGLOM_ID | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The following row is the victim *** 186 |ROW |U |0 |(1,9) |WAIT |T |NULL |T | The above row is the victim *** 184 |ROW |S |1 |(1,9) |GRANT|T |NULL |T | 186 |ROW |S |1 |(1,9) |GRANT|T |NULL |T | 184 |ROW |U |1 |(1,9) |GRANT|T |NULL |T | 184 |TABLE |IS |1 |Tablelock |GRANT|T |NULL |T | 186 |TABLE |IS |1 |Tablelock |GRANT|T |NULL |T | 184 |TABLE |IX |2 |Tablelock |GRANT|T |NULL |T | 186 |TABLE |IX |2 |Tablelock |GRANT|T |NULL |T | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Hide
          Mike Matrigali added a comment -

          It may also be interesting to see if this is somehow related to U lock vs. X lock. A similar test to one posted but changing the update to a delete would be interesting.

          Show
          Mike Matrigali added a comment - It may also be interesting to see if this is somehow related to U lock vs. X lock. A similar test to one posted but changing the update to a delete would be interesting.
          Hide
          Kathey Marsden added a comment -

          I pursued some of Mike's debugging steps.
          o verify if a deadlock search is being made or not?
          The code enters Deadlock.look() so if that means it is doing a deadlock search then we are.
          o run in sane mode and turn on the SanityManager.DEBUG_ON("DeadlockTrace") flag.
          I set this but don't see any output difference. In my calling program I just called SanityManager.DEBUG_ON("DeadlockTrace").
          Am I doing something wrong?
          o If a deadlock search is being made, use the sanity code to dump out the lock table when the deadlock search is done and
          misses the deadlock.
          Haven't tried this yet.
          o see if the bug still happens if there is only 1 row in the table.
          It still happens with one row.
          o see if it still happens with a delete
          The same problem exists with delete.
          o verify two threads are waiting on lock with <ctrl> break.
          Here are the two traces with the program set to do a delete instead of update.
          4XESTACKTRACE at java/lang/Object.wait(Native Method)
          4XESTACKTRACE at java/lang/Object.wait(Object.java:231)
          4XESTACKTRACE at org/apache/derby/impl/services/daemon/BasicDaemon.rest(BasicDaemon.java:571)
          4XESTACKTRACE at org/apache/derby/impl/services/daemon/BasicDaemon.run(BasicDaemon.java:388)
          4XESTACKTRACE at java/lang/Thread.run(Thread.java:803)
          3XMTHREADINFO "Thread-5" (TID:0x4203EC00, sys_thread_t:0x42286868, state:CW, native ID:0x00001D94) prio=5
          4XESTACKTRACE at java/lang/Object.wait(Native Method)
          4XESTACKTRACE at java/lang/Object.wait(Object.java:231)
          4XESTACKTRACE at org/apache/derby/impl/services/locks/ActiveLock.waitForGrant(ActiveLock.java:118)
          4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentLockSet.lockObject(ConcurrentLockSet.java:463(Compiled Code))
          4XESTACKTRACE at org/apache/derby/impl/services/locks/AbstractPool.lockObject(AbstractPool.java:119)
          4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentPool.lockObject(ConcurrentPool.java:28)
          4XESTACKTRACE at org/apache/derby/impl/store/raw/xact/RowLocking3.lockRecordForRead(RowLocking3.java:187)
          4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/OpenConglomerate.lockPositionForRead(OpenConglomerate.java:436)
          4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/GenericScanController.fetchRows(GenericScanController.java:683)
          4XESTACKTRACE at org/apache/derby/impl/store/access/heap/HeapScan.fetchNext(HeapScan.java:240)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/TableScanResultSet.getNextRowCore(TableScanResultSet.java:680)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:255)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.setup(DeleteResultSet.java:252)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.open(DeleteResultSet.java:136)
          4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:416)
          4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.execute(GenericPreparedStatement.java:297)
          4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeStatement(EmbedStatement.java:1235)
          4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.execute(EmbedStatement.java:625)
          4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeUpdate(EmbedStatement.java:175)
          4XESTACKTRACE at TryTimeout.threadWorker(TryTimeout.java:125)
          4XESTACKTRACE at TryTimeout.access$000(TryTimeout.java:5)
          4XESTACKTRACE at TryTimeout$1.run(TryTimeout.java:69)
          4XESTACKTRACE at java/lang/Thread.run(Thread.java:803)
          3XMTHREADINFO "Thread-6" (TID:0x4203F000, sys_thread_t:0x42286ABC, state:CW, native ID:0x00001C5C) prio=5
          4XESTACKTRACE at java/lang/Object.wait(Native Method)
          4XESTACKTRACE at java/lang/Object.wait(Object.java:231)
          4XESTACKTRACE at org/apache/derby/impl/services/locks/ActiveLock.waitForGrant(ActiveLock.java:118)
          4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentLockSet.lockObject(ConcurrentLockSet.java:463(Compiled Code))
          4XESTACKTRACE at org/apache/derby/impl/services/locks/AbstractPool.lockObject(AbstractPool.java:119)
          4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentPool.lockObject(ConcurrentPool.java:28)
          4XESTACKTRACE at org/apache/derby/impl/store/raw/xact/RowLocking3.lockRecordForWrite(RowLocking3.java:248)
          4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/OpenConglomerate.lockPositionForWrite(OpenConglomerate.java:541)
          4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/GenericConglomerateController.delete(GenericConglomerateController.java:179)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/RowChangerImpl.deleteRow(RowChangerImpl.java:478)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.collectAffectedRows(DeleteResultSet.java:405)
          4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.open(DeleteResultSet.java:137)
          4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:416)
          4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.execute(GenericPreparedStatement.java:297)
          4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeStatement(EmbedStatement.java:1235)
          4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.execute(EmbedStatement.java:625)
          4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeUpdate(EmbedStatement.java:175)
          4XESTACKTRACE at TryTimeout.threadWorker(TryTimeout.java:125)
          4XESTACKTRACE at TryTimeout.access$000(TryTimeout.java:5)
          4XESTACKTRACE at TryTimeout$1.run(TryTimeout.java:69)
          4XESTACKTRACE at java/lang/Thread.run(Thread.java:803)

          Show
          Kathey Marsden added a comment - I pursued some of Mike's debugging steps. o verify if a deadlock search is being made or not? The code enters Deadlock.look() so if that means it is doing a deadlock search then we are. o run in sane mode and turn on the SanityManager.DEBUG_ON("DeadlockTrace") flag. I set this but don't see any output difference. In my calling program I just called SanityManager.DEBUG_ON("DeadlockTrace"). Am I doing something wrong? o If a deadlock search is being made, use the sanity code to dump out the lock table when the deadlock search is done and misses the deadlock. Haven't tried this yet. o see if the bug still happens if there is only 1 row in the table. It still happens with one row. o see if it still happens with a delete The same problem exists with delete. o verify two threads are waiting on lock with <ctrl> break. Here are the two traces with the program set to do a delete instead of update. 4XESTACKTRACE at java/lang/Object.wait(Native Method) 4XESTACKTRACE at java/lang/Object.wait(Object.java:231) 4XESTACKTRACE at org/apache/derby/impl/services/daemon/BasicDaemon.rest(BasicDaemon.java:571) 4XESTACKTRACE at org/apache/derby/impl/services/daemon/BasicDaemon.run(BasicDaemon.java:388) 4XESTACKTRACE at java/lang/Thread.run(Thread.java:803) 3XMTHREADINFO "Thread-5" (TID:0x4203EC00, sys_thread_t:0x42286868, state:CW, native ID:0x00001D94) prio=5 4XESTACKTRACE at java/lang/Object.wait(Native Method) 4XESTACKTRACE at java/lang/Object.wait(Object.java:231) 4XESTACKTRACE at org/apache/derby/impl/services/locks/ActiveLock.waitForGrant(ActiveLock.java:118) 4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentLockSet.lockObject(ConcurrentLockSet.java:463(Compiled Code)) 4XESTACKTRACE at org/apache/derby/impl/services/locks/AbstractPool.lockObject(AbstractPool.java:119) 4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentPool.lockObject(ConcurrentPool.java:28) 4XESTACKTRACE at org/apache/derby/impl/store/raw/xact/RowLocking3.lockRecordForRead(RowLocking3.java:187) 4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/OpenConglomerate.lockPositionForRead(OpenConglomerate.java:436) 4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/GenericScanController.fetchRows(GenericScanController.java:683) 4XESTACKTRACE at org/apache/derby/impl/store/access/heap/HeapScan.fetchNext(HeapScan.java:240) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/TableScanResultSet.getNextRowCore(TableScanResultSet.java:680) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:255) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.setup(DeleteResultSet.java:252) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.open(DeleteResultSet.java:136) 4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:416) 4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.execute(GenericPreparedStatement.java:297) 4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeStatement(EmbedStatement.java:1235) 4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.execute(EmbedStatement.java:625) 4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeUpdate(EmbedStatement.java:175) 4XESTACKTRACE at TryTimeout.threadWorker(TryTimeout.java:125) 4XESTACKTRACE at TryTimeout.access$000(TryTimeout.java:5) 4XESTACKTRACE at TryTimeout$1.run(TryTimeout.java:69) 4XESTACKTRACE at java/lang/Thread.run(Thread.java:803) 3XMTHREADINFO "Thread-6" (TID:0x4203F000, sys_thread_t:0x42286ABC, state:CW, native ID:0x00001C5C) prio=5 4XESTACKTRACE at java/lang/Object.wait(Native Method) 4XESTACKTRACE at java/lang/Object.wait(Object.java:231) 4XESTACKTRACE at org/apache/derby/impl/services/locks/ActiveLock.waitForGrant(ActiveLock.java:118) 4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentLockSet.lockObject(ConcurrentLockSet.java:463(Compiled Code)) 4XESTACKTRACE at org/apache/derby/impl/services/locks/AbstractPool.lockObject(AbstractPool.java:119) 4XESTACKTRACE at org/apache/derby/impl/services/locks/ConcurrentPool.lockObject(ConcurrentPool.java:28) 4XESTACKTRACE at org/apache/derby/impl/store/raw/xact/RowLocking3.lockRecordForWrite(RowLocking3.java:248) 4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/OpenConglomerate.lockPositionForWrite(OpenConglomerate.java:541) 4XESTACKTRACE at org/apache/derby/impl/store/access/conglomerate/GenericConglomerateController.delete(GenericConglomerateController.java:179) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/RowChangerImpl.deleteRow(RowChangerImpl.java:478) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.collectAffectedRows(DeleteResultSet.java:405) 4XESTACKTRACE at org/apache/derby/impl/sql/execute/DeleteResultSet.open(DeleteResultSet.java:137) 4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:416) 4XESTACKTRACE at org/apache/derby/impl/sql/GenericPreparedStatement.execute(GenericPreparedStatement.java:297) 4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeStatement(EmbedStatement.java:1235) 4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.execute(EmbedStatement.java:625) 4XESTACKTRACE at org/apache/derby/impl/jdbc/EmbedStatement.executeUpdate(EmbedStatement.java:175) 4XESTACKTRACE at TryTimeout.threadWorker(TryTimeout.java:125) 4XESTACKTRACE at TryTimeout.access$000(TryTimeout.java:5) 4XESTACKTRACE at TryTimeout$1.run(TryTimeout.java:69) 4XESTACKTRACE at java/lang/Thread.run(Thread.java:803)
          Hide
          Kathey Marsden added a comment -

          I got the extra DeadlockTrace output on by putting this in my derby.properties.
          derby.debug.true=DeadlockTrace

          Show
          Kathey Marsden added a comment - I got the extra DeadlockTrace output on by putting this in my derby.properties. derby.debug.true=DeadlockTrace
          Hide
          Kathey Marsden added a comment -

          I am debugging this on 10.1 since at least there the Lock table showed two waiters on lock timeout.
          I changed the test case a bit to have one of the threads sleep for 10 seconds after the select and also to have the program dump the lock table every ten seconds. I found that after adding the sleep I usually get a deadlock instead of the timeout.
          Expected exception:A lock could not be obtained due to a deadlock, cycle of locks and waiters is:
          Lock : ROW, T, (1,7)
          Waiting XID :

          {254, U}

          , DB2ADMIN, delete from t where i = 456
          Granted XID :

          {256, S}
          Lock : ROW, T, (1,7)
          Waiting XID : {256, X} , DB2ADMIN, delete from t where i = 456
          Granted XID : {256, S}

          ,

          {254, S}

          but still I sometimes get the lock timeout. I also saw that at the time of deadlock checking, the lock tables for both the lock timeout run and the deadlock run show the same lock table which should deadlock.
          xid |username|trantype|type |cnt|mode|tabname |lockname |state|status
          ---------------------------------------------------------------------------------
          279 |UserTran|TABLE |1 |IS |T |Tablelock |GRANT|ACTIVE
          281 |UserTran|TABLE |1 |IS |T |Tablelock |GRANT|ACTIVE
          279 |UserTran|TABLE |2 |IX |T |Tablelock |GRANT|ACTIVE
          281 |UserTran|TABLE |2 |IX |T |Tablelock |GRANT|ACTIVE
          279 |UserTran|ROW |1 |S |T |(1,7) |GRANT|ACTIVE
          281 |UserTran|ROW |1 |S |T |(1,7) |GRANT|ACTIVE
          279 |UserTran|ROW |0 |U |T |(1,7) |WAIT|ACTIVE
          281 |UserTran|ROW |1 |U |T |(1,7) |GRANT|ACTIVE
          281 |UserTran|ROW |0 |X |T |(1,7) |WAIT|ACTIVE

          Mike mentioned to me that he thought it might be a problem with both threads doing the deadlock check at the same time, but it seems like with the sleep that couldn't be the problem. Let me know if you would like me to post the updated program.

          Show
          Kathey Marsden added a comment - I am debugging this on 10.1 since at least there the Lock table showed two waiters on lock timeout. I changed the test case a bit to have one of the threads sleep for 10 seconds after the select and also to have the program dump the lock table every ten seconds. I found that after adding the sleep I usually get a deadlock instead of the timeout. Expected exception:A lock could not be obtained due to a deadlock, cycle of locks and waiters is: Lock : ROW, T, (1,7) Waiting XID : {254, U} , DB2ADMIN, delete from t where i = 456 Granted XID : {256, S} Lock : ROW, T, (1,7) Waiting XID : {256, X} , DB2ADMIN, delete from t where i = 456 Granted XID : {256, S} , {254, S} but still I sometimes get the lock timeout. I also saw that at the time of deadlock checking, the lock tables for both the lock timeout run and the deadlock run show the same lock table which should deadlock. xid |username|trantype|type |cnt|mode|tabname |lockname |state|status --------------------------------------------------------------------------------- 279 |UserTran|TABLE |1 |IS |T |Tablelock |GRANT|ACTIVE 281 |UserTran|TABLE |1 |IS |T |Tablelock |GRANT|ACTIVE 279 |UserTran|TABLE |2 |IX |T |Tablelock |GRANT|ACTIVE 281 |UserTran|TABLE |2 |IX |T |Tablelock |GRANT|ACTIVE 279 |UserTran|ROW |1 |S |T |(1,7) |GRANT|ACTIVE 281 |UserTran|ROW |1 |S |T |(1,7) |GRANT|ACTIVE 279 |UserTran|ROW |0 |U |T |(1,7) |WAIT|ACTIVE 281 |UserTran|ROW |1 |U |T |(1,7) |GRANT|ACTIVE 281 |UserTran|ROW |0 |X |T |(1,7) |WAIT|ACTIVE Mike mentioned to me that he thought it might be a problem with both threads doing the deadlock check at the same time, but it seems like with the sleep that couldn't be the problem. Let me know if you would like me to post the updated program.
          Hide
          Kathey Marsden added a comment -

          Here is the updated program in case someone is interested, TryTimeout2.java . Changes from TryTimeout.java are
          1) Second thread sleeps for 10 second after select.
          2) Only has one row in the table.
          3) Does delete instead of update.
          4) Starts a thread that shows a lock table dump every 10 seconds.
          5) This one fails sometimes with deadlock and sometimes with lock timeout.

          Show
          Kathey Marsden added a comment - Here is the updated program in case someone is interested, TryTimeout2.java . Changes from TryTimeout.java are 1) Second thread sleeps for 10 second after select. 2) Only has one row in the table. 3) Does delete instead of update. 4) Starts a thread that shows a lock table dump every 10 seconds. 5) This one fails sometimes with deadlock and sometimes with lock timeout.
          Hide
          Knut Anders Hatlen added a comment -

          Just a guess here, since I haven't checked the details. In the inner
          for loop of Deadlock.look(), there's some code to prevent false
          positives for self-deadlocks:

          // We could be seeing a situation here like
          // Granted T1

          {S}, T2{S}

          // Waiting T1

          {X}

          - deadlock checking on this
          //
          // In this case it's not a deadlock, although it
          // depends on the locking policy of the Lockable. E.g.
          .
          .
          .
          if (lock.canSkip)

          { // not a deadlock ... chain.push(space); // set up as rollback() expects. rollback(chain); continue outer; }

          I'm wondering if this perhaps is too broad. Both the transactions that
          are waiting for the X lock already have a shared lock on the row, so
          they'll both have lock.canSkip==true. We may be giving up finding a
          deadlock along that path too early.

          Show
          Knut Anders Hatlen added a comment - Just a guess here, since I haven't checked the details. In the inner for loop of Deadlock.look(), there's some code to prevent false positives for self-deadlocks: // We could be seeing a situation here like // Granted T1 {S}, T2{S} // Waiting T1 {X} - deadlock checking on this // // In this case it's not a deadlock, although it // depends on the locking policy of the Lockable. E.g. . . . if (lock.canSkip) { // not a deadlock ... chain.push(space); // set up as rollback() expects. rollback(chain); continue outer; } I'm wondering if this perhaps is too broad. Both the transactions that are waiting for the X lock already have a shared lock on the row, so they'll both have lock.canSkip==true. We may be giving up finding a deadlock along that path too early.
          Hide
          Kathey Marsden added a comment -

          I did the runs with some additional println's.
          1) Upon entering Deadlock.look, I print:
          System.out.println(Thread.currentThread().getName() + " Deadlock look");
          System.out.println(set.toDebugString());
          2) Upon entering the block of code Knut mentioned I print
          if (lock.canSkip) {
          System.out.println("rollback(chain). Not a deadlock");

          Here is the output of the run with deadlock TryTimeout2.out.10_1.deadlock and the run with timeout TryTimeout2.out.10_1.locktimeout.

          In both cases it first does a Deadlock.look for Thread0 which does not detect any deadlocks.

          On the Deadlock.look for Thread1 we see it pass through the "Not a deadlock" line for the deadlock case, not the lock timeout case, which is the reverse of what I would expect if that code were the problem.

          What I do notice is that always when we deadlock the lock[0] and lock[1] of the Thread1 LockControl objects are in the reverse order from the timeout run. Perhaps there is some problem in the logic when they are ordered in this way.

          A few general things I don't understand. Sorry for my newbieness.

          1) Why did we not detect the deadlock when Thread0 did Deadlock.look()? The locks were the same at that point.

          2) Why in the timeout case do we do a second Deadlock.look on Thread0?

          3) Should the order of the LockControl objects matter?

          4) Could someone give me a high level description of the arguments to Deadlock.look()
          static Object[] look(SinglePool factory, LockSet set, LockControl control, ActiveLock startingLock, byte deadlockWake) {
          There is no javadoc. I would be happy to add some if someone could post the information here.

          Show
          Kathey Marsden added a comment - I did the runs with some additional println's. 1) Upon entering Deadlock.look, I print: System.out.println(Thread.currentThread().getName() + " Deadlock look"); System.out.println(set.toDebugString()); 2) Upon entering the block of code Knut mentioned I print if (lock.canSkip) { System.out.println("rollback(chain). Not a deadlock"); Here is the output of the run with deadlock TryTimeout2.out.10_1.deadlock and the run with timeout TryTimeout2.out.10_1.locktimeout. In both cases it first does a Deadlock.look for Thread0 which does not detect any deadlocks. On the Deadlock.look for Thread1 we see it pass through the "Not a deadlock" line for the deadlock case, not the lock timeout case, which is the reverse of what I would expect if that code were the problem. What I do notice is that always when we deadlock the lock [0] and lock [1] of the Thread1 LockControl objects are in the reverse order from the timeout run. Perhaps there is some problem in the logic when they are ordered in this way. A few general things I don't understand. Sorry for my newbieness. 1) Why did we not detect the deadlock when Thread0 did Deadlock.look()? The locks were the same at that point. 2) Why in the timeout case do we do a second Deadlock.look on Thread0? 3) Should the order of the LockControl objects matter? 4) Could someone give me a high level description of the arguments to Deadlock.look() static Object[] look(SinglePool factory, LockSet set, LockControl control, ActiveLock startingLock, byte deadlockWake) { There is no javadoc. I would be happy to add some if someone could post the information here.
          Hide
          Mike Matrigali added a comment -

          I don't have answers, but more questions. It looks like there are also timing bugs going on. It might help if the program dumped more timing information along with print outs.

          For the timeout case, ignoring that it should be a deadlock, it seems like we should see a timeout after 60 seconds for Thread0 and then a timeout for Thread1 after 70 seconds, but it looks like they both timeout at the same time.

          Show
          Mike Matrigali added a comment - I don't have answers, but more questions. It looks like there are also timing bugs going on. It might help if the program dumped more timing information along with print outs. For the timeout case, ignoring that it should be a deadlock, it seems like we should see a timeout after 60 seconds for Thread0 and then a timeout for Thread1 after 70 seconds, but it looks like they both timeout at the same time.
          Hide
          Mike Matrigali added a comment -

          how about this as start for the javadoc:
          /**

          • Do a deadlock search.
          • <p>
          • Walk through the graph of all locks and search for cycles among
          • the waiting lock requests which would indicate a deadlock. A simple
          • deadlock cycle is where the granted locks of waiting compatibility
          • space A is blocking compatibility space B and space B holds locks causing
          • space A to wait.
          • <p>
          • Would be nice to get a better high level description of deadlock
          • search.
          • <p>
            *
          • @return The identifier to be used to open the conglomerate later.
            *
          • @param factory The locking system factory
          • @param set The complete lock table. A lock table is a hash
          • table keyed by a Lockable and with a LockControl as
          • the data element.
          • @param control A LockControl contains a reference to the item being
          • locked and doubly linked lists for the granted locks
          • and the waiting locks. The passed in value is the
          • lock that the caller was waiting on when woken up
          • to do the deadlock check.
          • @param startingLock represents the specific waiting lock request that
          • the caller has been waiting on, before just being
          • woken up to do this search.
          • @deadlockWake Either Constants.WAITING_LOCK_IN_WAIT, or
          • Constants.WAITING_LOCK_DEADLOCK. I am not sure
          • about this one.
            *
          • @exception StandardException Standard exception policy.
            **/
          Show
          Mike Matrigali added a comment - how about this as start for the javadoc: /** Do a deadlock search. <p> Walk through the graph of all locks and search for cycles among the waiting lock requests which would indicate a deadlock. A simple deadlock cycle is where the granted locks of waiting compatibility space A is blocking compatibility space B and space B holds locks causing space A to wait. <p> Would be nice to get a better high level description of deadlock search. <p> * @return The identifier to be used to open the conglomerate later. * @param factory The locking system factory @param set The complete lock table. A lock table is a hash table keyed by a Lockable and with a LockControl as the data element. @param control A LockControl contains a reference to the item being locked and doubly linked lists for the granted locks and the waiting locks. The passed in value is the lock that the caller was waiting on when woken up to do the deadlock check. @param startingLock represents the specific waiting lock request that the caller has been waiting on, before just being woken up to do this search. @deadlockWake Either Constants.WAITING_LOCK_IN_WAIT, or Constants.WAITING_LOCK_DEADLOCK. I am not sure about this one. * @exception StandardException Standard exception policy. **/
          Hide
          Mike Matrigali added a comment -

          > 1) Why did we not detect the deadlock when Thread0 did Deadlock.look()? The locks were the same at that point.
          To me this looks like another instance of the bug. Either thread searching should find the deadlock once they have both gone
          into waiting, which I believe is the case here. I believe knut's comment on canSkip may be relevant. I still don't understand the
          code using skip.

          > 3) Should the order of the LockControl objects matter?
          If things are working right deadlock detection should find it no matter what, the order though might change which one is picked as victim.

          Show
          Mike Matrigali added a comment - > 1) Why did we not detect the deadlock when Thread0 did Deadlock.look()? The locks were the same at that point. To me this looks like another instance of the bug. Either thread searching should find the deadlock once they have both gone into waiting, which I believe is the case here. I believe knut's comment on canSkip may be relevant. I still don't understand the code using skip. > 3) Should the order of the LockControl objects matter? If things are working right deadlock detection should find it no matter what, the order though might change which one is picked as victim.
          Hide
          Kathey Marsden added a comment -

          Maybe the ordering was a red herring. Here is another run with timestamps and also with printing out index:" + index +" chain.size():" + chain.size() + " chain.indexOf(grants):" + chain.indexOf(grants) before we evaluate whether to skip.

          Here the order of the LockSet is the same for both runs.

          The timestamp for the exceptions in the output is off because the exceptions are accumulated and then print, but the derby.log does show the two lock timeout exceptions with almost the same timestamp so there does seem to be a bug there.
          I will try to make a simpler test case and file a bug for that.

          Show
          Kathey Marsden added a comment - Maybe the ordering was a red herring. Here is another run with timestamps and also with printing out index:" + index +" chain.size():" + chain.size() + " chain.indexOf(grants):" + chain.indexOf(grants) before we evaluate whether to skip. Here the order of the LockSet is the same for both runs. The timestamp for the exceptions in the output is off because the exceptions are accumulated and then print, but the derby.log does show the two lock timeout exceptions with almost the same timestamp so there does seem to be a bug there. I will try to make a simpler test case and file a bug for that.
          Hide
          Mike Matrigali added a comment -

          I've stared at the code below and am not clear what it is trying to do. Main questions are
          is it just trying to catch self waits?
          and why is canSkip useful to call, what is the code counting on canSkip to tell it?

          The case in question, posed as the below comment is:
          Granted T0(S), T1(S), T0(U)
          Waiting T0(X)
          Waiting T1(U)

          What is the expected value of canSkip for the 2 waiting locks. Looking at the code I think both are going to have canSkip
          set to true to indicate that it is ok to grant them in front of other waiting locks, because the space already holds a lock that
          is blocking the waiting locks - but only if they are compatible with all other space granted locks.

          if (index != -1) {

          // We could be seeing a situation here like
          // Granted T1

          {S}, T2{S}

          // Waiting T1

          {X}

          - deadlock checking on this
          //
          // In this case it's not a deadlock, although it
          // depends on the locking policy of the Lockable. E.g.
          // Granted T1(latch)
          // Waiting T1(latch)
          // is a deadlock.
          //

          if ((index == (chain.size() - 1)) ||
          ((index == (chain.size() - 2))
          && (index == (chain.indexOf(grants) - 1)))) {

          // potential self deadlock, but probably not!
          ActiveLock lock = (ActiveLock) waiters.get(space);

          if (lock.canSkip)

          { // not a deadlock ... chain.push(space); // set up as rollback() expects. rollback(chain); continue outer; }

          }

          return Deadlock.handle(factory, chain, index, waiters, deadl

          Show
          Mike Matrigali added a comment - I've stared at the code below and am not clear what it is trying to do. Main questions are is it just trying to catch self waits? and why is canSkip useful to call, what is the code counting on canSkip to tell it? The case in question, posed as the below comment is: Granted T0(S), T1(S), T0(U) Waiting T0(X) Waiting T1(U) What is the expected value of canSkip for the 2 waiting locks. Looking at the code I think both are going to have canSkip set to true to indicate that it is ok to grant them in front of other waiting locks, because the space already holds a lock that is blocking the waiting locks - but only if they are compatible with all other space granted locks. if (index != -1) { // We could be seeing a situation here like // Granted T1 {S}, T2{S} // Waiting T1 {X} - deadlock checking on this // // In this case it's not a deadlock, although it // depends on the locking policy of the Lockable. E.g. // Granted T1(latch) // Waiting T1(latch) // is a deadlock. // if ((index == (chain.size() - 1)) || ((index == (chain.size() - 2)) && (index == (chain.indexOf(grants) - 1)))) { // potential self deadlock, but probably not! ActiveLock lock = (ActiveLock) waiters.get(space); if (lock.canSkip) { // not a deadlock ... chain.push(space); // set up as rollback() expects. rollback(chain); continue outer; } } return Deadlock.handle(factory, chain, index, waiters, deadl
          Hide
          Kathey Marsden added a comment -

          Attaching a trunk patch derby-3980_javadoc_and_test_diff.txt to update the javadoc as Mike suggested and to add a test for this issue which of course currently fails. The test should be ready to add into the suite once the issue is fixed. Ran javadoc build and saw no warnings. I'll check this in later this afternoon.

          Show
          Kathey Marsden added a comment - Attaching a trunk patch derby-3980_javadoc_and_test_diff.txt to update the javadoc as Mike suggested and to add a test for this issue which of course currently fails. The test should be ready to add into the suite once the issue is fixed. Ran javadoc build and saw no warnings. I'll check this in later this afternoon.
          Hide
          Mike Matrigali added a comment -

          i am wondering if the problem is that lock manager not beeing able to tell the difference for the purpose of deadlock detection
          the case of a waitor on a waitor because of live lock, and a waitor on a waitor because both are blocked. lockSkip may come into
          it. In this case I think lockSkip is getting set for both waitors, which properly means it is ok for them to skip waitors as they already
          hold a lock blocking waitors, but they can't get granted now.

          granted: T0(S), T1(S), T0(U)
          waiting: T1(U), T0(X)

          I think at least in the lastest repro which does 2 deadlock scans something is going wrong during 1st search which misses
          the deadlock with respect to waiting on waitors case. The problem comes when
          we get to I think T0(X) which is marked waiting on a waitor. For this we get to the "waiting on a waitor" part of the deadlock search and
          it does not push anything on the stack. Seems like it should push whatever granted locks are incompatible with it. It just changes
          the current search compatibility space to be the waitor space, which may be fine if it is just waiting for waitors in front of it to get
          granted and there is not current incompatible lock.

          we need to make sure any fix does not start giving back false positives for livelock type waits. The test suite really has almost no
          testing for dealock/livelock cases, probably should add some as part of this fix. The problem in the past has been depending on
          timing it is sometimes hard to guarantee which waiting thread gets the deadlock in a system independent way.

          Show
          Mike Matrigali added a comment - i am wondering if the problem is that lock manager not beeing able to tell the difference for the purpose of deadlock detection the case of a waitor on a waitor because of live lock, and a waitor on a waitor because both are blocked. lockSkip may come into it. In this case I think lockSkip is getting set for both waitors, which properly means it is ok for them to skip waitors as they already hold a lock blocking waitors, but they can't get granted now. granted: T0(S), T1(S), T0(U) waiting: T1(U), T0(X) I think at least in the lastest repro which does 2 deadlock scans something is going wrong during 1st search which misses the deadlock with respect to waiting on waitors case. The problem comes when we get to I think T0(X) which is marked waiting on a waitor. For this we get to the "waiting on a waitor" part of the deadlock search and it does not push anything on the stack. Seems like it should push whatever granted locks are incompatible with it. It just changes the current search compatibility space to be the waitor space, which may be fine if it is just waiting for waitors in front of it to get granted and there is not current incompatible lock. we need to make sure any fix does not start giving back false positives for livelock type waits. The test suite really has almost no testing for dealock/livelock cases, probably should add some as part of this fix. The problem in the past has been depending on timing it is sometimes hard to guarantee which waiting thread gets the deadlock in a system independent way.
          Hide
          Kathey Marsden added a comment - - edited

          Thanks Mike for the tips and recommending tests. I am studying the code but must admit I am still a bit baffled by it all.
          I do think I can help by writing regression tests, so we can feel confident in the fix when we get one. I'll write one for the simple deadlock case that I know works and I'd like to write a test for the live lock scenario. Do I just need three threads, 2 doing a repeated staggered select and another trying to get in and do an update?

          Are there any other tests that would be good?

          Thanks

          Kathey

          Show
          Kathey Marsden added a comment - - edited Thanks Mike for the tips and recommending tests. I am studying the code but must admit I am still a bit baffled by it all. I do think I can help by writing regression tests, so we can feel confident in the fix when we get one. I'll write one for the simple deadlock case that I know works and I'd like to write a test for the live lock scenario. Do I just need three threads, 2 doing a repeated staggered select and another trying to get in and do an update? Are there any other tests that would be good? Thanks Kathey
          Hide
          Kathey Marsden added a comment -

          Attached is a test that test for live lock. I would appreciate it if someone could review to make sure I am testing the right thing. I have
          SelectThread1 - starts right away.
          for 10 tries

          { - select * from t; -wait 3 seconds - commit }

          SelectThread2
          Wait 1 second then the same as SelectThread1>

          UpdateThread
          Wait 2 seconds
          UPDATE T SET I = 456 where I = 456

          Deadlock timeout set to 5 seconds
          Lock timeout set to 10 seconds.

          Makes sure the update goes through without exception.

          Show
          Kathey Marsden added a comment - Attached is a test that test for live lock. I would appreciate it if someone could review to make sure I am testing the right thing. I have SelectThread1 - starts right away. for 10 tries { - select * from t; -wait 3 seconds - commit } SelectThread2 Wait 1 second then the same as SelectThread1> UpdateThread Wait 2 seconds UPDATE T SET I = 456 where I = 456 Deadlock timeout set to 5 seconds Lock timeout set to 10 seconds. Makes sure the update goes through without exception.
          Hide
          Kathey Marsden added a comment -

          I am attaching an old user case which should show a deadlock but gives lock timeouts. I think it is just the same as DERBY-3980 except with inserts instead of deletes or updates, so I don't think it warrants another test. Posting it here so it can be tried with the fix once there is one and perhaps someone else will see something worthwhile adding to the tests.

          Show
          Kathey Marsden added a comment - I am attaching an old user case which should show a deadlock but gives lock timeouts. I think it is just the same as DERBY-3980 except with inserts instead of deletes or updates, so I don't think it warrants another test. Posting it here so it can be tried with the fix once there is one and perhaps someone else will see something worthwhile adding to the tests.
          Hide
          Knut Anders Hatlen added a comment -

          The test looks OK to me. It tests what its class javadoc says (that the two readers don't starve the writer). But wasn't the point actually to test that the deadlock detection doesn't incorrectly report it as a deadlock? I'm not sure if deadlock detection will be performed in this case. If I understand the description of the test correctly, the update thread will be granted the exclusive lock after two seconds, but deadlock detection is not performed unless it needs to wait more than five seconds.

          Some minor nits:

          • This code in selectWorker() doesn't do anything:

          + if (delay > 0)
          + ;

          • The while loop in selectWorker() checks the exit condition twice, here

          + while (!isUpdateDone() && tries <= 10) {

          and here

          + if (tries == 10)
          + fail("Update did not occur after 10 selects");

          The first check has no effect currently, and it should probably be removed so that one later doesn't disable the last check by accident if one decides to change to some other number of iterations and forgets to update one of the places.

          Show
          Knut Anders Hatlen added a comment - The test looks OK to me. It tests what its class javadoc says (that the two readers don't starve the writer). But wasn't the point actually to test that the deadlock detection doesn't incorrectly report it as a deadlock? I'm not sure if deadlock detection will be performed in this case. If I understand the description of the test correctly, the update thread will be granted the exclusive lock after two seconds, but deadlock detection is not performed unless it needs to wait more than five seconds. Some minor nits: This code in selectWorker() doesn't do anything: + if (delay > 0) + ; The while loop in selectWorker() checks the exit condition twice, here + while (!isUpdateDone() && tries <= 10) { and here + if (tries == 10) + fail("Update did not occur after 10 selects"); The first check has no effect currently, and it should probably be removed so that one later doesn't disable the last check by accident if one decides to change to some other number of iterations and forgets to update one of the places.
          Hide
          Kathey Marsden added a comment -

          Thanks Knut for looking at the patch. I committed revision 727147 for the changes you recommended. The test for deadlock detection I checked in was Derby3980DeadlockTest. That test currently fails with a lock timeout, so is not part of the suite.

          I added LiveLockTest because Mike suggested that we add a test for the live lock scenario as part of this issue, so we make sure we don't regress. Perhaps it would have been clearer to open a separate issue. I hope I understood live lock correctly and that's what's being tested with LiveLockTest. If not, I'll try to make another shot at it when I get back from holiday in a few weeks time. Hopefully the mysteries of Deadlock.look() will be unlocked during that time and this issue can be resolved. If so the test Derby3980DeadlockTest should be enabled.

          Kathey

          Show
          Kathey Marsden added a comment - Thanks Knut for looking at the patch. I committed revision 727147 for the changes you recommended. The test for deadlock detection I checked in was Derby3980DeadlockTest. That test currently fails with a lock timeout, so is not part of the suite. I added LiveLockTest because Mike suggested that we add a test for the live lock scenario as part of this issue, so we make sure we don't regress. Perhaps it would have been clearer to open a separate issue. I hope I understood live lock correctly and that's what's being tested with LiveLockTest. If not, I'll try to make another shot at it when I get back from holiday in a few weeks time. Hopefully the mysteries of Deadlock.look() will be unlocked during that time and this issue can be resolved. If so the test Derby3980DeadlockTest should be enabled. Kathey
          Hide
          Knut Anders Hatlen added a comment -

          I understood Mike's comment as that we should make sure a livelock wasn't detected as a deadlock (false positive) after we make the changes, which means we would have to run deadlock detection on a possible livelock situation in order to test it. I have problems seeing how changes to the deadlock detection algorithm could cause a regression in a test that doesn't exercise that algorithm. But I agree that it's good to have this test too. Just shortening the deadlock timeout or increasing the time the select threads hold the locks would additionally make it test the deadlock detection algorithm, though.

          Show
          Knut Anders Hatlen added a comment - I understood Mike's comment as that we should make sure a livelock wasn't detected as a deadlock (false positive) after we make the changes, which means we would have to run deadlock detection on a possible livelock situation in order to test it. I have problems seeing how changes to the deadlock detection algorithm could cause a regression in a test that doesn't exercise that algorithm. But I agree that it's good to have this test too. Just shortening the deadlock timeout or increasing the time the select threads hold the locks would additionally make it test the deadlock detection algorithm, though.
          Hide
          Kathey Marsden added a comment -

          Here is the updated patch where I confirmed we enter Deadlock.look(). I made the sleep 25 seconds and used the default deadlock and wait timeouts. I tried just adjusting the deadlock timeout down to achieve the same result, but found we were never entering Deadlock.look(). There seems to be a problem with the timeout values at least when set low.

          Show
          Kathey Marsden added a comment - Here is the updated patch where I confirmed we enter Deadlock.look(). I made the sleep 25 seconds and used the default deadlock and wait timeouts. I tried just adjusting the deadlock timeout down to achieve the same result, but found we were never entering Deadlock.look(). There seems to be a problem with the timeout values at least when set low.
          Hide
          Knut Anders Hatlen added a comment -

          Do you see the problem with low timeout values if you change the nesting order of the CleanDatabaseTestSetup and the DatabasePropertyTestSetup? Many tests set the timeouts to even lower values (typically 2 seconds deadlock timeout and 4 seconds wait timeout).

          Show
          Knut Anders Hatlen added a comment - Do you see the problem with low timeout values if you change the nesting order of the CleanDatabaseTestSetup and the DatabasePropertyTestSetup? Many tests set the timeouts to even lower values (typically 2 seconds deadlock timeout and 4 seconds wait timeout).
          Hide
          Knut Anders Hatlen added a comment -

          I think I see now why we have a problem with a lower deadlock timeout. Since the updater is supposed to be granted the lock after two seconds, we need to set the deadlock timeout to one second in order to trigger the deadlock detection. However, just before the updater has waited for one second, the first select thread releases its lock and wakes up all waiters. This early wakeup makes the updater check if it can obtain the lock, which it can't, but it won't do a deadlock check. It then goes back to sleep for another second, but before it wakes up to perform deadlock detection, it is granted the lock. (It looks like we allow five early wakeups before we reduce the time to wait for a deadlock. So for each early wakeup, the deadlock timeout is effectively increased. This is probably a good way to do it, since early wakeups indicate that it's probably not a deadlock yet.)

          Easy workaround: Set the deadlock timeout to 1 second, and make the select threads hold the locks for 4 seconds instead of 3 seconds. Then the update thread will wait for two seconds before it gets the early wakeup (because the first select thread releases the lock), and by that time it will already have performed deadlock detection.

          Show
          Knut Anders Hatlen added a comment - I think I see now why we have a problem with a lower deadlock timeout. Since the updater is supposed to be granted the lock after two seconds, we need to set the deadlock timeout to one second in order to trigger the deadlock detection. However, just before the updater has waited for one second, the first select thread releases its lock and wakes up all waiters. This early wakeup makes the updater check if it can obtain the lock, which it can't, but it won't do a deadlock check. It then goes back to sleep for another second, but before it wakes up to perform deadlock detection, it is granted the lock. (It looks like we allow five early wakeups before we reduce the time to wait for a deadlock. So for each early wakeup, the deadlock timeout is effectively increased. This is probably a good way to do it, since early wakeups indicate that it's probably not a deadlock yet.) Easy workaround: Set the deadlock timeout to 1 second, and make the select threads hold the locks for 4 seconds instead of 3 seconds. Then the update thread will wait for two seconds before it gets the early wakeup (because the first select thread releases the lock), and by that time it will already have performed deadlock detection.
          Hide
          Kathey Marsden added a comment -

          Thanks Knut for figuring out how to set the sleep and deadlock timeout. I made the changes you suggested. Although we are now going through deadlock detection, I am still not entirely sure I am testing the scenario Mike was worried about. If you or Mike have any suggestions on more test scenarios, please let me know and I will add them. I think I will abandon staring at the deadlock detection code for a while. I hope someone that understands this code better than I will fix this bug.

          Show
          Kathey Marsden added a comment - Thanks Knut for figuring out how to set the sleep and deadlock timeout. I made the changes you suggested. Although we are now going through deadlock detection, I am still not entirely sure I am testing the scenario Mike was worried about. If you or Mike have any suggestions on more test scenarios, please let me know and I will add them. I think I will abandon staring at the deadlock detection code for a while. I hope someone that understands this code better than I will fix this bug.
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2. Changed component to Services, since that's where the deadlock detection code is located.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2. Changed component to Services, since that's where the deadlock detection code is located.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch for this issue. It's essentially the same patch as the 1b patch attached to DERBY-5073, only that the code has been re-indented to account for the extra nesting level added in that patch.

          The patch also adds a regression test with a test case both for this issue and for DERBY-5073. The test cases fail with a lock timeout without the fix and pass (get the expected deadlock exception) with the fix.

          I'm running the regression test suites on the patch now and intend to commit it to trunk if the tests pass.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch for this issue. It's essentially the same patch as the 1b patch attached to DERBY-5073 , only that the code has been re-indented to account for the extra nesting level added in that patch. The patch also adds a regression test with a test case both for this issue and for DERBY-5073 . The test cases fail with a lock timeout without the fix and pass (get the expected deadlock exception) with the fix. I'm running the regression test suites on the patch now and intend to commit it to trunk if the tests pass.
          Hide
          Bryan Pendleton added a comment -

          Wow! Now that's a test case!

          Seriously, though, the new tests look excellent; they're commented very
          well, and the code is clearly written. There's no easy way to write deadlock
          test cases; your effort here should lead the way and make it easier for
          people to write additional deadlock test cases in the future. Thanks very
          much for taking the time to do so.

          It's too bad we can't set the deadlock timeout to less than a second. Nowadays,
          that's an eternity on a modern computer.

          Show
          Bryan Pendleton added a comment - Wow! Now that's a test case! Seriously, though, the new tests look excellent; they're commented very well, and the code is clearly written. There's no easy way to write deadlock test cases; your effort here should lead the way and make it easier for people to write additional deadlock test cases in the future. Thanks very much for taking the time to do so. It's too bad we can't set the deadlock timeout to less than a second. Nowadays, that's an eternity on a modern computer.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Bryan. I take that as a +1, then.

          Committed revision 1081455.

          I agree with your comment about sub-second deadlock timeouts. Actually, one may even consider doing deadlock detection every time a lock cannot be granted immediately. I wouldn't think that it would be too expensive, though I've never measured the cost of the deadlock detection, and if you have to wait for a lock, you've lost performance-wise in any case, so one might just as well do the extra check. But that's another JIRA issue, I guess...

          Show
          Knut Anders Hatlen added a comment - Thanks, Bryan. I take that as a +1, then. Committed revision 1081455. I agree with your comment about sub-second deadlock timeouts. Actually, one may even consider doing deadlock detection every time a lock cannot be granted immediately. I wouldn't think that it would be too expensive, though I've never measured the cost of the deadlock detection, and if you have to wait for a lock, you've lost performance-wise in any case, so one might just as well do the extra check. But that's another JIRA issue, I guess...
          Hide
          Kathey Marsden added a comment -

          Closing this issue. Thank you Knut for the great fix and figuring out the tricky deadlock code.

          Show
          Kathey Marsden added a comment - Closing this issue. Thank you Knut for the great fix and figuring out the tricky deadlock code.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Kathey Marsden
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development