Derby
  1. Derby
  2. DERBY-5073

Derby deadlocks without recourse on simultaneous correlated subqueries

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 10.0.2.1, 10.1.2.1, 10.2.2.0, 10.3.3.0, 10.4.2.0, 10.5.3.0, 10.6.2.1, 10.7.1.1, 10.8.1.2
    • Fix Version/s: None
    • Component/s: Services
    • Labels:
      None
    • Issue & fix info:
      Repro attached
    • Bug behavior facts:
      Performance, Seen in production

      Description

      When the following two queries are run against tables that contain the necessary fields, using multiple threads, Derby deadlocks and none of the queries ever returns. Derby apparently detects no deadlock condition, either.

      SELECT t0.* FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid) AND t0.jobid=?
      SELECT t0.* FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.newField=?) AND t0.jobid=?

      This code comes from Apache ManifoldCF, and has occurred when there are five or more threads trying to execute these two queries at the same time. Originally we found this on 10.5.3.0. It was hoped that 10.7.1.1 would fix the problem, but it hasn't.

      1. Derby5073.java
        3 kB
        Knut Anders Hatlen
      2. derby-5073-1a.diff
        0.5 kB
        Knut Anders Hatlen
      3. derby-5073-1b.diff
        2 kB
        Knut Anders Hatlen
      4. comments.diff
        13 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Karl Wright added a comment -

          A thread dump on 10.5.3.0 yielded multiple threads waiting with stack frames looking like this:

          "Thread-1100" daemon prio=6 tid=0x03be9800 nid=0xd70 in Object.wait() [0x05a5e000]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
          at java.lang.Object.wait(Object.java:485)
          at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(Unknown Source)

          • locked <0x285dc400> (a org.apache.derby.impl.services.locks.ActiveLock)
            at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(Unknown Source)
            at org.apache.derby.impl.services.locks.ConcurrentLockSet.zeroDurationLockObject(Unknown Source)
            at org.apache.derby.impl.services.locks.AbstractPool.zeroDurationlockObject(Unknown Source)
            at org.apache.derby.impl.services.locks.ConcurrentPool.zeroDurationlockObject(Unknown Source)
            at org.apache.derby.impl.store.raw.xact.RowLocking2nohold.lockRecordForRead(Unknown Source)
            at org.apache.derby.impl.store.access.heap.HeapController.lockRow(Unknown Source)
            at org.apache.derby.impl.store.access.heap.HeapController.lockRow(Unknown Source)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(Unknown Source)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(Unknown Source)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(Unknown Source)
            at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(Unknown Source)
            at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(Unknown Source)
            at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.reloadArray(Unknown Source)
            at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.getNextRowCore(Unknown Source)
            at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(Unknown Source)
            at org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown Source)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown Source)
            at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(Unknown Source)
            at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(Unknown Source)
          • locked <0x2858b388> (a org.apache.derby.impl.jdbc.EmbedConnection40)
            at org.apache.derby.impl.jdbc.EmbedResultSet.next(Unknown Source)
          Show
          Karl Wright added a comment - A thread dump on 10.5.3.0 yielded multiple threads waiting with stack frames looking like this: "Thread-1100" daemon prio=6 tid=0x03be9800 nid=0xd70 in Object.wait() [0x05a5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(Unknown Source) locked <0x285dc400> (a org.apache.derby.impl.services.locks.ActiveLock) at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(Unknown Source) at org.apache.derby.impl.services.locks.ConcurrentLockSet.zeroDurationLockObject(Unknown Source) at org.apache.derby.impl.services.locks.AbstractPool.zeroDurationlockObject(Unknown Source) at org.apache.derby.impl.services.locks.ConcurrentPool.zeroDurationlockObject(Unknown Source) at org.apache.derby.impl.store.raw.xact.RowLocking2nohold.lockRecordForRead(Unknown Source) at org.apache.derby.impl.store.access.heap.HeapController.lockRow(Unknown Source) at org.apache.derby.impl.store.access.heap.HeapController.lockRow(Unknown Source) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(Unknown Source) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(Unknown Source) at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(Unknown Source) at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(Unknown Source) at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(Unknown Source) at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.reloadArray(Unknown Source) at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(Unknown Source) at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(Unknown Source) locked <0x2858b388> (a org.apache.derby.impl.jdbc.EmbedConnection40) at org.apache.derby.impl.jdbc.EmbedResultSet.next(Unknown Source)
          Hide
          Knut Anders Hatlen added a comment -

          Has the application set derby.locks.waitTimeout? The Object.wait() call in the stack trace should have a timeout, unless derby.locks.waitTimeout is negative, so I don't understand how it could be hanging there forever.

          Are there any threads updating the database in parallel with the threads that perform these queries, or are the SELECT threads the only threads working on the database?

          Show
          Knut Anders Hatlen added a comment - Has the application set derby.locks.waitTimeout? The Object.wait() call in the stack trace should have a timeout, unless derby.locks.waitTimeout is negative, so I don't understand how it could be hanging there forever. Are there any threads updating the database in parallel with the threads that perform these queries, or are the SELECT threads the only threads working on the database?
          Hide
          Karl Wright added a comment -

          (1) Yes, the wait timeout is set to infinity.
          (2) Database updating is happening at the same time by other threads, but I could not find any thread that was blocking in an update when this block occurred. However, I should note that these queries are likely to be taking place within a transaction, and it is possible that writes occurred in the same transaction already, but I cannot confirm that from the thread dump.

          Show
          Karl Wright added a comment - (1) Yes, the wait timeout is set to infinity. (2) Database updating is happening at the same time by other threads, but I could not find any thread that was blocking in an update when this block occurred. However, I should note that these queries are likely to be taking place within a transaction, and it is possible that writes occurred in the same transaction already, but I cannot confirm that from the thread dump.
          Hide
          Knut Anders Hatlen added a comment -

          Is derby.locks.deadlockTimeout also set to infinity? If so, that would explain why the deadlock isn't reported, if a deadlock is the underlying problem here.

          Note that a thread doesn't need to be blocked in an update in order to hold an exclusive lock. It's enough that it has performed an update without committing or aborting the transaction after on. You can connect to the database and execute the statement "select * from syscs_diag.lock_table" to find out if some transaction is holding an exclusive lock.

          Show
          Knut Anders Hatlen added a comment - Is derby.locks.deadlockTimeout also set to infinity? If so, that would explain why the deadlock isn't reported, if a deadlock is the underlying problem here. Note that a thread doesn't need to be blocked in an update in order to hold an exclusive lock. It's enough that it has performed an update without committing or aborting the transaction after on. You can connect to the database and execute the statement "select * from syscs_diag.lock_table" to find out if some transaction is holding an exclusive lock.
          Hide
          Karl Wright added a comment -

          The deadlock timeout is NOT set to infinity, but to 60 seconds. It would be perfectly OK if Derby returned the appropriate SQL exception indicating deadlock under this circumstance.

          The suggestion to connect to the database is great but unfortunately this is using the embedded JDBC driver, so trying to connect while it's hung probably won't work.

          Show
          Karl Wright added a comment - The deadlock timeout is NOT set to infinity, but to 60 seconds. It would be perfectly OK if Derby returned the appropriate SQL exception indicating deadlock under this circumstance. The suggestion to connect to the database is great but unfortunately this is using the embedded JDBC driver, so trying to connect while it's hung probably won't work.
          Hide
          Karl Wright added a comment -

          Also, just to make it very clear, I am already certain that a transaction IS holding an exclusive lock. The issue is that deadlock is never detected by Derby, and it must be detected. If it is not, there is no possibility of recovery from this situation, other than relying on the lock timeout, which then limits the longest running transaction I can have.

          Show
          Karl Wright added a comment - Also, just to make it very clear, I am already certain that a transaction IS holding an exclusive lock. The issue is that deadlock is never detected by Derby, and it must be detected. If it is not, there is no possibility of recovery from this situation, other than relying on the lock timeout, which then limits the longest running transaction I can have.
          Hide
          Knut Anders Hatlen added a comment -

          Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself. That is, there has to be a cycle in the wait graph. It's not enough that a transaction is simply holding an exclusive lock for a very long time without committing. I'm not sure if that's what's happening here, but that situation is supposed to be handled by the lock timeout.

          One way to connect to the embedded database while the system is hung, is to set the system property derby.drda.startNetworkServer=true - http://db.apache.org/derby/docs/10.7/adminguide/radminconfigstartnetworkserver.html - and connect to the database with ij using the client driver.

          Show
          Knut Anders Hatlen added a comment - Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself. That is, there has to be a cycle in the wait graph. It's not enough that a transaction is simply holding an exclusive lock for a very long time without committing. I'm not sure if that's what's happening here, but that situation is supposed to be handled by the lock timeout. One way to connect to the embedded database while the system is hung, is to set the system property derby.drda.startNetworkServer=true - http://db.apache.org/derby/docs/10.7/adminguide/radminconfigstartnetworkserver.html - and connect to the database with ij using the client driver.
          Hide
          Karl Wright added a comment -

          Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself.

          Of course. I see no reason why that should not be the case here. There are no threads with a transaction open that are waiting on anything else.

          Show
          Karl Wright added a comment - Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself. Of course. I see no reason why that should not be the case here. There are no threads with a transaction open that are waiting on anything else.
          Hide
          Karl Wright added a comment -

          The locks table at the time of a hang looks like this:

          ij> select * from syscs_diag.lock_table;

          XID |TYPE |MODE|TABLENAME
          303942 |ROW |X |JOBQUEUE

          303956 |ROW |S |JOBQUEUE

          303959 |ROW |S |JOBQUEUE

          303962 |ROW |S |JOBQUEUE

          304018 |ROW |X |JOBQUEUE

          303974 |ROW |X |JOBQUEUE

          303987 |TABLE|IX |JOBS

          303962 |ROW |X |JOBQUEUE

          303965 |ROW |S |JOBQUEUE

          303968 |ROW |S |JOBQUEUE

          303971 |ROW |S |JOBQUEUE

          303974 |ROW |S |JOBQUEUE

          303977 |ROW |S |JOBQUEUE

          303987 |ROW |S |JOBQUEUE

          304018 |ROW |S |JOBQUEUE

          303942 |TABLE|IS |CARRYDOWN

          303956 |TABLE|IS |CARRYDOWN

          303959 |TABLE|IS |CARRYDOWN

          303962 |TABLE|IS |CARRYDOWN

          303965 |TABLE|IS |CARRYDOWN

          303968 |TABLE|IS |CARRYDOWN

          303971 |TABLE|IS |CARRYDOWN

          303974 |TABLE|IS |CARRYDOWN

          303977 |TABLE|IS |CARRYDOWN

          304018 |TABLE|IS |CARRYDOWN

          303942 |TABLE|IX |PREREQEVENTS

          303956 |TABLE|IX |PREREQEVENTS

          303959 |TABLE|IX |PREREQEVENTS

          303962 |TABLE|IX |PREREQEVENTS

          303965 |TABLE|IX |PREREQEVENTS

          303968 |TABLE|IX |PREREQEVENTS

          303971 |TABLE|IX |PREREQEVENTS

          303974 |TABLE|IX |PREREQEVENTS

          303977 |TABLE|IX |PREREQEVENTS

          304018 |TABLE|IX |PREREQEVENTS

          303956 |ROW |X |JOBQUEUE

          303977 |ROW |X |JOBQUEUE

          303968 |ROW |X |JOBQUEUE

          303965 |ROW |X |JOBQUEUE

          303959 |ROW |X |JOBQUEUE

          303971 |ROW |X |JOBQUEUE

          303942 |ROW |S |JOBQUEUE

          303942 |TABLE|IX |JOBQUEUE

          303942 |TABLE|IS |JOBQUEUE

          303956 |TABLE|IX |JOBQUEUE

          303956 |TABLE|IS |JOBQUEUE

          303959 |TABLE|IX |JOBQUEUE

          303959 |TABLE|IS |JOBQUEUE

          303962 |TABLE|IX |JOBQUEUE

          303962 |TABLE|IS |JOBQUEUE

          303965 |TABLE|IX |JOBQUEUE

          303965 |TABLE|IS |JOBQUEUE

          303968 |TABLE|IX |JOBQUEUE

          303968 |TABLE|IS |JOBQUEUE

          303971 |TABLE|IX |JOBQUEUE

          303971 |TABLE|IS |JOBQUEUE

          303974 |TABLE|IX |JOBQUEUE

          303974 |TABLE|IS |JOBQUEUE

          303977 |TABLE|IX |JOBQUEUE

          303977 |TABLE|IS |JOBQUEUE

          303987 |TABLE|IS |JOBQUEUE

          304018 |TABLE|IX |JOBQUEUE

          304018 |TABLE|IS |JOBQUEUE

          63 rows selected

          Since the process is still active, other threads that the ones that get hung may well be active. I have confirmed that all blocked threads are blocked due to database, and for no other reason.

          Show
          Karl Wright added a comment - The locks table at the time of a hang looks like this: ij> select * from syscs_diag.lock_table; XID |TYPE |MODE|TABLENAME 303942 |ROW |X |JOBQUEUE 303956 |ROW |S |JOBQUEUE 303959 |ROW |S |JOBQUEUE 303962 |ROW |S |JOBQUEUE 304018 |ROW |X |JOBQUEUE 303974 |ROW |X |JOBQUEUE 303987 |TABLE|IX |JOBS 303962 |ROW |X |JOBQUEUE 303965 |ROW |S |JOBQUEUE 303968 |ROW |S |JOBQUEUE 303971 |ROW |S |JOBQUEUE 303974 |ROW |S |JOBQUEUE 303977 |ROW |S |JOBQUEUE 303987 |ROW |S |JOBQUEUE 304018 |ROW |S |JOBQUEUE 303942 |TABLE|IS |CARRYDOWN 303956 |TABLE|IS |CARRYDOWN 303959 |TABLE|IS |CARRYDOWN 303962 |TABLE|IS |CARRYDOWN 303965 |TABLE|IS |CARRYDOWN 303968 |TABLE|IS |CARRYDOWN 303971 |TABLE|IS |CARRYDOWN 303974 |TABLE|IS |CARRYDOWN 303977 |TABLE|IS |CARRYDOWN 304018 |TABLE|IS |CARRYDOWN 303942 |TABLE|IX |PREREQEVENTS 303956 |TABLE|IX |PREREQEVENTS 303959 |TABLE|IX |PREREQEVENTS 303962 |TABLE|IX |PREREQEVENTS 303965 |TABLE|IX |PREREQEVENTS 303968 |TABLE|IX |PREREQEVENTS 303971 |TABLE|IX |PREREQEVENTS 303974 |TABLE|IX |PREREQEVENTS 303977 |TABLE|IX |PREREQEVENTS 304018 |TABLE|IX |PREREQEVENTS 303956 |ROW |X |JOBQUEUE 303977 |ROW |X |JOBQUEUE 303968 |ROW |X |JOBQUEUE 303965 |ROW |X |JOBQUEUE 303959 |ROW |X |JOBQUEUE 303971 |ROW |X |JOBQUEUE 303942 |ROW |S |JOBQUEUE 303942 |TABLE|IX |JOBQUEUE 303942 |TABLE|IS |JOBQUEUE 303956 |TABLE|IX |JOBQUEUE 303956 |TABLE|IS |JOBQUEUE 303959 |TABLE|IX |JOBQUEUE 303959 |TABLE|IS |JOBQUEUE 303962 |TABLE|IX |JOBQUEUE 303962 |TABLE|IS |JOBQUEUE 303965 |TABLE|IX |JOBQUEUE 303965 |TABLE|IS |JOBQUEUE 303968 |TABLE|IX |JOBQUEUE 303968 |TABLE|IS |JOBQUEUE 303971 |TABLE|IX |JOBQUEUE 303971 |TABLE|IS |JOBQUEUE 303974 |TABLE|IX |JOBQUEUE 303974 |TABLE|IS |JOBQUEUE 303977 |TABLE|IX |JOBQUEUE 303977 |TABLE|IS |JOBQUEUE 303987 |TABLE|IS |JOBQUEUE 304018 |TABLE|IX |JOBQUEUE 304018 |TABLE|IS |JOBQUEUE 63 rows selected Since the process is still active, other threads that the ones that get hung may well be active. I have confirmed that all blocked threads are blocked due to database, and for no other reason.
          Hide
          Karl Wright added a comment -

          Is there anything else that would be helpful to diagnose why Derby does not detect a deadlock here? I am not certain how to read this output, nor am I sure it will help because it doesn't tell us what lock the code is TRYING to grab.

          Show
          Karl Wright added a comment - Is there anything else that would be helpful to diagnose why Derby does not detect a deadlock here? I am not certain how to read this output, nor am I sure it will help because it doesn't tell us what lock the code is TRYING to grab.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Karl,

          It might be useful to include the STATE column from the diagnostic table, so that we see which locks are granted and which are being waited for. Then we may be able to reconstruct the wait graph and see if it contains any cycles that the deadlock detection algorithm has missed.

          Show
          Knut Anders Hatlen added a comment - Hi Karl, It might be useful to include the STATE column from the diagnostic table, so that we see which locks are granted and which are being waited for. Then we may be able to reconstruct the wait graph and see if it contains any cycles that the deadlock detection algorithm has missed.
          Hide
          Karl Wright added a comment -

          I'm sorry - which diagnostic table has a STATE column you wish to see? I've printed all columns from the Locks diagnostic table, as you can see, and there is no STATE column.

          Show
          Karl Wright added a comment - I'm sorry - which diagnostic table has a STATE column you wish to see? I've printed all columns from the Locks diagnostic table, as you can see, and there is no STATE column.
          Hide
          Knut Anders Hatlen added a comment -

          That's odd... When I execute "select * from syscs_diag.lock_table" in ij, I see the following columns: XID, TYPE, MODE, TABLENAME, LOCKNAME, STATE, TABLETYPE, LOCKCOUNT, INDEXNAME.

          Show
          Knut Anders Hatlen added a comment - That's odd... When I execute "select * from syscs_diag.lock_table" in ij, I see the following columns: XID, TYPE, MODE, TABLENAME, LOCKNAME, STATE, TABLETYPE, LOCKCOUNT, INDEXNAME.
          Hide
          Karl Wright added a comment -

          It looks to me like this is getting truncated. I'm confirming; the guy with this problem who I am in communication with may have inadvertantly done something funny with cut-and-paste. When I do the same thing here, I get the columns you describe. Stay tuned.

          Show
          Karl Wright added a comment - It looks to me like this is getting truncated. I'm confirming; the guy with this problem who I am in communication with may have inadvertantly done something funny with cut-and-paste. When I do the same thing here, I get the columns you describe. Stay tuned.
          Hide
          Karl Wright added a comment -

          Ok, this is better:

          ij> select * from syscs_diag.lock_table;
          XID |TYPE |MODE|TABLENAME |LOCKNAME |STATE|TABLETYPE|LOCK&|INDEXNAME

          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          ---------------------------
          319444 |ROW |X |JOBQUEUE |(3,279) |GRANT|T |11 |NULL

          319447 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319450 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319453 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319459 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319456 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319462 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319553 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL

          319435 |ROW |X |JOBQUEUE |(3,306) |GRANT|T |11 |NULL

          319447 |ROW |X |JOBQUEUE |(3,304) |GRANT|T |11 |NULL

          319435 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319440 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319444 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319447 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319450 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319453 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319459 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319456 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319462 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319553 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL

          319435 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319440 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319444 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319447 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319450 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319453 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319456 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319459 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319462 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319553 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL

          319459 |ROW |X |JOBQUEUE |(3,308) |GRANT|T |11 |NULL

          323231 |ROW |S |JOBQUEUE |(3,308) |WAIT |T |0 |NULL

          319456 |ROW |X |JOBQUEUE |(3,307) |GRANT|T |11 |NULL

          319440 |ROW |X |JOBQUEUE |(3,295) |GRANT|T |11 |NULL

          319435 |ROW |S |JOBQUEUE |(3,295) |WAIT |T |0 |NULL

          319444 |ROW |S |JOBQUEUE |(3,295) |WAIT |T |0 |NULL

          319450 |ROW |X |JOBQUEUE |(3,297) |GRANT|T |11 |NULL

          319553 |ROW |X |JOBQUEUE |(3,300) |GRANT|T |11 |NULL

          319462 |ROW |X |JOBQUEUE |(3,301) |GRANT|T |11 |NULL

          319440 |ROW |S |JOBQUEUE |(3,301) |WAIT |T |0 |NULL

          319453 |ROW |X |JOBQUEUE |(3,302) |GRANT|T |11 |NULL

          319440 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319435 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319435 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319440 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319444 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319444 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319447 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319447 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319450 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319450 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319453 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319453 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319456 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319459 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319459 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319456 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319462 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319462 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          319553 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL

          319553 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL

          323231 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |1 |NULL

          62 rows selected
          ij>

          Show
          Karl Wright added a comment - Ok, this is better: ij> select * from syscs_diag.lock_table; XID |TYPE |MODE|TABLENAME |LOCKNAME |STATE|TABLETYPE|LOCK&|INDEXNAME ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------- 319444 |ROW |X |JOBQUEUE |(3,279) |GRANT|T |11 |NULL 319447 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319450 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319453 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319459 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319456 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319462 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319553 |ROW |S |JOBQUEUE |(3,279) |WAIT |T |0 |NULL 319435 |ROW |X |JOBQUEUE |(3,306) |GRANT|T |11 |NULL 319447 |ROW |X |JOBQUEUE |(3,304) |GRANT|T |11 |NULL 319435 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319440 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319444 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319447 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319450 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319453 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319459 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319456 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319462 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319553 |TABLE|IS |CARRYDOWN |Tablelock |GRANT|T |1 |NULL 319435 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319440 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319444 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319447 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319450 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319453 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319456 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319459 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319462 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319553 |TABLE|IX |PREREQEVENTS |Tablelock |GRANT|T |2 |NULL 319459 |ROW |X |JOBQUEUE |(3,308) |GRANT|T |11 |NULL 323231 |ROW |S |JOBQUEUE |(3,308) |WAIT |T |0 |NULL 319456 |ROW |X |JOBQUEUE |(3,307) |GRANT|T |11 |NULL 319440 |ROW |X |JOBQUEUE |(3,295) |GRANT|T |11 |NULL 319435 |ROW |S |JOBQUEUE |(3,295) |WAIT |T |0 |NULL 319444 |ROW |S |JOBQUEUE |(3,295) |WAIT |T |0 |NULL 319450 |ROW |X |JOBQUEUE |(3,297) |GRANT|T |11 |NULL 319553 |ROW |X |JOBQUEUE |(3,300) |GRANT|T |11 |NULL 319462 |ROW |X |JOBQUEUE |(3,301) |GRANT|T |11 |NULL 319440 |ROW |S |JOBQUEUE |(3,301) |WAIT |T |0 |NULL 319453 |ROW |X |JOBQUEUE |(3,302) |GRANT|T |11 |NULL 319440 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319435 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319435 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319440 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319444 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319444 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319447 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319447 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319450 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319450 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319453 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319453 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319456 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319459 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319459 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319456 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319462 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319462 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 319553 |TABLE|IX |JOBQUEUE |Tablelock |GRANT|T |11 |NULL 319553 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |2 |NULL 323231 |TABLE|IS |JOBQUEUE |Tablelock |GRANT|T |1 |NULL 62 rows selected ij>
          Hide
          Bryan Pendleton added a comment -

          It looks like there is indeed a cycle:

          • 319462 is waiting on 319444
          • 319444 is waiting on 319440
          • 319440 is waiting on 319462
          Show
          Bryan Pendleton added a comment - It looks like there is indeed a cycle: 319462 is waiting on 319444 319444 is waiting on 319440 319440 is waiting on 319462
          Hide
          Karl Wright added a comment -

          So, am I wrong to expect that Derby should detect this deadlock, and throw the appropriate SQLException?

          Show
          Karl Wright added a comment - So, am I wrong to expect that Derby should detect this deadlock, and throw the appropriate SQLException?
          Hide
          Knut Anders Hatlen added a comment -

          Hi Karl,

          Yes, that looks like a cycle, and Derby should have been able to detect it and throw an SQLException.

          Show
          Knut Anders Hatlen added a comment - Hi Karl, Yes, that looks like a cycle, and Derby should have been able to detect it and throw an SQLException.
          Hide
          Knut Anders Hatlen added a comment -

          I think I'm able to reproduce the problem. At least, I've been able to construct a deadlock that looks similar to the one in the lock dump Karl provided. See the attached Java program (Derby5073.java). There's a cycle involving three transactions, as well as some other transactions that wait on the same locks without actually being part of the cycle themselves.

          When disabling waitTimeout, the program runs forever, whereas it should have been terminated with a deadlock error. Even when it's stuck, it keeps one CPU fully occupied. Perhaps an infinite loop in the deadlock detection?

          Show
          Knut Anders Hatlen added a comment - I think I'm able to reproduce the problem. At least, I've been able to construct a deadlock that looks similar to the one in the lock dump Karl provided. See the attached Java program (Derby5073.java). There's a cycle involving three transactions, as well as some other transactions that wait on the same locks without actually being part of the cycle themselves. When disabling waitTimeout, the program runs forever, whereas it should have been terminated with a deadlock error. Even when it's stuck, it keeps one CPU fully occupied. Perhaps an infinite loop in the deadlock detection?
          Hide
          Knut Anders Hatlen added a comment -

          Seen all the way back to 10.0.

          Show
          Knut Anders Hatlen added a comment - Seen all the way back to 10.0.
          Hide
          Knut Anders Hatlen added a comment -

          I took a look at this in the debugger. The deadlock is indeed detected, and a victim is chosen and woken up. The victim rechecks that there still is a deadlock, so that it doesn't have to give up if it's no longer part of a deadlock. When rechecking, it finds that there is a deadlock, and picks another victim, whereas it goes back to sleep waiting for the lock again itself. The new victim will do the same, and pick yet another victim. And so it goes on.

          There is code in place to prevent a victim from picking another victim when rechecking the deadlock, but for some reason that code doesn't work in this scenario.

          Show
          Knut Anders Hatlen added a comment - I took a look at this in the debugger. The deadlock is indeed detected, and a victim is chosen and woken up. The victim rechecks that there still is a deadlock, so that it doesn't have to give up if it's no longer part of a deadlock. When rechecking, it finds that there is a deadlock, and picks another victim, whereas it goes back to sleep waiting for the lock again itself. The new victim will do the same, and pick yet another victim. And so it goes on. There is code in place to prevent a victim from picking another victim when rechecking the deadlock, but for some reason that code doesn't work in this scenario.
          Hide
          Bryan Pendleton added a comment -

          For those of us vicariously following along, where is this code?

          Show
          Bryan Pendleton added a comment - For those of us vicariously following along, where is this code?
          Hide
          Knut Anders Hatlen added a comment -

          The code is in Deadlock.handle():

          // See if the checker is in the deadlock and we
          // already picked as a victim
          if ((checker.equals(space)) && (deadlockWake == Constants.WAITING_LOCK_DEADLOCK))

          { victim = checker; break; }

          It never kicks in, and instead it goes further down in the method and wakes another victim:

          ActiveLock victimLock = (ActiveLock) waiters.get(victim);

          victimLock.wakeUp(Constants.WAITING_LOCK_DEADLOCK);

          The new victim wakes up from it's waiting state in ActiveLock.waitForGrant()/ConcurrentLockSet.lockObject(), calls checkDeadlock() and ends up Deadlock.handle() again.

          I think the problem may be caused by the following piece of code in Deadlock.look():

          } else

          { // simply waiting on another waiter space = waitingLock.getCompatabilitySpace(); }

          As far as I can see, this code doesn't make any sense. space will already have the same value as waitingLock.getCompatabilitySpace(), so the operation is actually a no-op. (waitingLock is obtained by calling waiters.get(space), and the waiters Map is built up by (waitingLock.getCompatabilitySpace(), waitingLock) value pairs, see LockControl.addWaiters().) Furthermore, this leads to "space" being considered twice in a row by the deadlock detection, so that it thinks that the transaction owning that compatibility space is waiting for one of its own locks. It therefore detects the deadlock prematurely, and before it has seen all transactions involved in it, and incorrectly concludes that the original victim wasn't involved.

          By changing that last piece of code from a no-op to actually moving one step ahead in the wait graph, the repro does fail with a deadlock error. That is, change the assignment to:

          space = ((ActiveLock) waitOn).getCompatabilitySpace();

          I tried running the regression tests with that change, and they all passed. I do find the deadlock detection code a bit hard to follow, so I'm not totally convinced this is the right change.

          Show
          Knut Anders Hatlen added a comment - The code is in Deadlock.handle(): // See if the checker is in the deadlock and we // already picked as a victim if ((checker.equals(space)) && (deadlockWake == Constants.WAITING_LOCK_DEADLOCK)) { victim = checker; break; } It never kicks in, and instead it goes further down in the method and wakes another victim: ActiveLock victimLock = (ActiveLock) waiters.get(victim); victimLock.wakeUp(Constants.WAITING_LOCK_DEADLOCK); The new victim wakes up from it's waiting state in ActiveLock.waitForGrant()/ConcurrentLockSet.lockObject(), calls checkDeadlock() and ends up Deadlock.handle() again. I think the problem may be caused by the following piece of code in Deadlock.look(): } else { // simply waiting on another waiter space = waitingLock.getCompatabilitySpace(); } As far as I can see, this code doesn't make any sense. space will already have the same value as waitingLock.getCompatabilitySpace(), so the operation is actually a no-op. (waitingLock is obtained by calling waiters.get(space), and the waiters Map is built up by (waitingLock.getCompatabilitySpace(), waitingLock) value pairs, see LockControl.addWaiters().) Furthermore, this leads to "space" being considered twice in a row by the deadlock detection, so that it thinks that the transaction owning that compatibility space is waiting for one of its own locks. It therefore detects the deadlock prematurely, and before it has seen all transactions involved in it, and incorrectly concludes that the original victim wasn't involved. By changing that last piece of code from a no-op to actually moving one step ahead in the wait graph, the repro does fail with a deadlock error. That is, change the assignment to: space = ((ActiveLock) waitOn).getCompatabilitySpace(); I tried running the regression tests with that change, and they all passed. I do find the deadlock detection code a bit hard to follow, so I'm not totally convinced this is the right change.
          Hide
          Knut Anders Hatlen added a comment -

          Here's the patch I tested.

          Show
          Knut Anders Hatlen added a comment - Here's the patch I tested.
          Hide
          Mike Matrigali added a comment -

          I have stared at this code before and also find it hard to understand. There are hardly any tests in the code for deadlocks so passing does not mean much.
          A long time ago I looked at DERBY-3980 and did not get very far, I wonder if it related. your comments there and here seem similar.

          Show
          Mike Matrigali added a comment - I have stared at this code before and also find it hard to understand. There are hardly any tests in the code for deadlocks so passing does not mean much. A long time ago I looked at DERBY-3980 and did not get very far, I wonder if it related. your comments there and here seem similar.
          Hide
          Knut Anders Hatlen added a comment -

          With the patch applied, the repro attached to DERBY-3980 indeed fails with the expected deadlock exception, so those two issues may be related.

          Show
          Knut Anders Hatlen added a comment - With the patch applied, the repro attached to DERBY-3980 indeed fails with the expected deadlock exception, so those two issues may be related.
          Hide
          Knut Anders Hatlen added a comment -

          When running the repro attached here, I do get a deadlock with the patch. However, typically three or four of the six threads get a deadlock exception, whereas it should be enough with one victim. Not sure if that's something caused by my change, or if it's a general problem with our algorithm that it just picks one random transaction that happens to be waiting for a lock in the cycle, even if the transaction isn't necessarily part of the cycle itself.

          Show
          Knut Anders Hatlen added a comment - When running the repro attached here, I do get a deadlock with the patch. However, typically three or four of the six threads get a deadlock exception, whereas it should be enough with one victim. Not sure if that's something caused by my change, or if it's a general problem with our algorithm that it just picks one random transaction that happens to be waiting for a lock in the cycle, even if the transaction isn't necessarily part of the cycle itself.
          Hide
          Knut Anders Hatlen added a comment -

          To illustrate the point in my previous comment, here's the first deadlock error when running the attached repro:

          A lock could not be obtained due to a deadlock, cycle of locks and waiters is:
          Lock : ROW, T, (1,7)
          Waiting XID :

          {182, S}

          , APP, select x from t where x = 1
          Granted XID :

          {162, X}


          Lock : ROW, T, (1,8)
          Waiting XID :

          {162, S}

          , APP, select x from t where x = 2
          Granted XID :

          {166, X}


          Lock : ROW, T, (1,9)
          Waiting XID :

          {166, S}

          , APP, select x from t where x = 3
          Granted XID :

          {170, X}


          Lock : ROW, T, (1,7)
          Waiting XID :

          {170, S}

          , APP, select x from t where x = 1
          . The selected victim is XID : 182.

          No one is waiting for tx 182, so aborting it doesn't really resolve the deadlock, and therefore other transactions will have to be aborted later. Also note that the data for row (1,7) has been split in two in the output.

          Show
          Knut Anders Hatlen added a comment - To illustrate the point in my previous comment, here's the first deadlock error when running the attached repro: A lock could not be obtained due to a deadlock, cycle of locks and waiters is: Lock : ROW, T, (1,7) Waiting XID : {182, S} , APP, select x from t where x = 1 Granted XID : {162, X} Lock : ROW, T, (1,8) Waiting XID : {162, S} , APP, select x from t where x = 2 Granted XID : {166, X} Lock : ROW, T, (1,9) Waiting XID : {166, S} , APP, select x from t where x = 3 Granted XID : {170, X} Lock : ROW, T, (1,7) Waiting XID : {170, S} , APP, select x from t where x = 1 . The selected victim is XID : 182. No one is waiting for tx 182, so aborting it doesn't really resolve the deadlock, and therefore other transactions will have to be aborted later. Also note that the data for row (1,7) has been split in two in the output.
          Hide
          Knut Anders Hatlen added a comment -

          It looks to me as if the problem is that the deadlock detection gets confused by the threads that wait for shared locks on the same rows as the transactions involved in the deadlock are waiting for shared locks on.

          Those transactions are not really involved in the deadlock. But because they started waiting for the locks earlier than the threads actually involved in the deadlock, they are first in the queue of waiters. The deadlock detection algorithm incorrectly assumes that these transactions are blocking for the transactions involved in the deadlock.

          The assumption is incorrect because they are all waiting for shared locks, so once the first waiter in the queue gets it, the second waiter will also get it. This means that the second waiter isn't waiting for the first waiter here, but they're both waiting for some other lock that is blocking them both.

          The attached modified patch (1b) changes the algorithm so that it skips past chains of waiters that have compatible lock requests. That made the repro fail with exactly one deadlock exception.

          Show
          Knut Anders Hatlen added a comment - It looks to me as if the problem is that the deadlock detection gets confused by the threads that wait for shared locks on the same rows as the transactions involved in the deadlock are waiting for shared locks on. Those transactions are not really involved in the deadlock. But because they started waiting for the locks earlier than the threads actually involved in the deadlock, they are first in the queue of waiters. The deadlock detection algorithm incorrectly assumes that these transactions are blocking for the transactions involved in the deadlock. The assumption is incorrect because they are all waiting for shared locks, so once the first waiter in the queue gets it, the second waiter will also get it. This means that the second waiter isn't waiting for the first waiter here, but they're both waiting for some other lock that is blocking them both. The attached modified patch (1b) changes the algorithm so that it skips past chains of waiters that have compatible lock requests. That made the repro fail with exactly one deadlock exception.
          Hide
          Knut Anders Hatlen added a comment -

          The deadlock reported with the 1b patch looks like this:

          A lock could not be obtained due to a deadlock, cycle of locks and waiters is:
          Lock : ROW, T, (1,8)
          Waiting XID :

          {162, S}

          , APP, select x from t where x = 2
          Granted XID :

          {166, X}


          Lock : ROW, T, (1,9)
          Waiting XID :

          {166, S}

          , APP, select x from t where x = 3
          Granted XID :

          {170, X}


          Lock : ROW, T, (1,7)
          Waiting XID :

          {170, S}

          , APP, select x from t where x = 1
          Granted XID :

          {162, X}


          . The selected victim is XID : 162.

          This looks more reasonable than what the 1a patch reported, I think.

          Show
          Knut Anders Hatlen added a comment - The deadlock reported with the 1b patch looks like this: A lock could not be obtained due to a deadlock, cycle of locks and waiters is: Lock : ROW, T, (1,8) Waiting XID : {162, S} , APP, select x from t where x = 2 Granted XID : {166, X} Lock : ROW, T, (1,9) Waiting XID : {166, S} , APP, select x from t where x = 3 Granted XID : {170, X} Lock : ROW, T, (1,7) Waiting XID : {170, S} , APP, select x from t where x = 1 Granted XID : {162, X} . The selected victim is XID : 162. This looks more reasonable than what the 1a patch reported, I think.
          Hide
          Bryan Pendleton added a comment -

          Your analysis makes a lot of sense to me. Thank you for taking the time to
          walk us through the code and your thought process, it has been very illuminating.

          I think you have identified a real problem and your approach is good.

          Show
          Bryan Pendleton added a comment - Your analysis makes a lot of sense to me. Thank you for taking the time to walk us through the code and your thought process, it has been very illuminating. I think you have identified a real problem and your approach is good.
          Hide
          Kathey Marsden added a comment -

          Thank you Knut for working on this issue. I stared at the deadlock code a very long time n the context of DERBY-3980 before giving up and so am thrilled to see you are able to get your head around it. I want to ask a favor while you are in there. If there is any insight in the form of comments that you can add to the deadlock code, I think that would be extremely educational and helpful to anyone that has to visit the deadlock code again.

          Show
          Kathey Marsden added a comment - Thank you Knut for working on this issue. I stared at the deadlock code a very long time n the context of DERBY-3980 before giving up and so am thrilled to see you are able to get your head around it. I want to ask a favor while you are in there. If there is any insight in the form of comments that you can add to the deadlock code, I think that would be extremely educational and helpful to anyone that has to visit the deadlock code again.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the feedback, Bryan and Kathey. I'll polish the patch and incorporate the repros attached to this issue and DERBY-3980 into a JUnit test that can run as part of the regression test suite.

          Since we don't know for sure that the repro I attached reproduces the exact same problem as originally reported, we should probably check in the fix under DERBY-3980 and mark that issue as resolved. When Karl gets a chance to verify that this also fixes the problem he saw, we can close this issue as a duplicate of DERBY-3980.

          And, yes, I'll annotate the deadlock detection code with comments reflecting my understanding of what it does and how it should work. Good idea. Thanks!

          Show
          Knut Anders Hatlen added a comment - Thanks for the feedback, Bryan and Kathey. I'll polish the patch and incorporate the repros attached to this issue and DERBY-3980 into a JUnit test that can run as part of the regression test suite. Since we don't know for sure that the repro I attached reproduces the exact same problem as originally reported, we should probably check in the fix under DERBY-3980 and mark that issue as resolved. When Karl gets a chance to verify that this also fixes the problem he saw, we can close this issue as a duplicate of DERBY-3980 . And, yes, I'll annotate the deadlock detection code with comments reflecting my understanding of what it does and how it should work. Good idea. Thanks!
          Hide
          Karl Wright added a comment -

          I'll keep my eyes open for the final patch, and be ready to try it out.

          Show
          Karl Wright added a comment - I'll keep my eyes open for the final patch, and be ready to try it out.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Karl,

          I've posted a patch over at DERBY-3980 (derby-3980-1a.diff) and also committed it to the subversion repository. It would be great if you could test if the fix also solves the problem you're seeing.

          Thanks.

          Show
          Knut Anders Hatlen added a comment - Hi Karl, I've posted a patch over at DERBY-3980 (derby-3980-1a.diff) and also committed it to the subversion repository. It would be great if you could test if the fix also solves the problem you're seeing. Thanks.
          Hide
          Karl Wright added a comment -

          I built Derby trunk and checked it in, and notified the field that a fix may possibly be had for the problem they've been seeing. I expect they will respond in about 12 hours, and then I will let you know.

          Show
          Karl Wright added a comment - I built Derby trunk and checked it in, and notified the field that a fix may possibly be had for the problem they've been seeing. I expect they will respond in about 12 hours, and then I will let you know.
          Hide
          Karl Wright added a comment -

          The tests at the client site have passed, so the fix is working.

          Show
          Karl Wright added a comment - The tests at the client site have passed, so the fix is working.
          Hide
          Karl Wright added a comment -

          Hmm. I merged the fix into 10.7.1.1, and ran a test, and got a deadlock hang again. Strange.
          I'm going to have to try the same case with a straight trunk build just to be sure I didn't mess anything up.

          Show
          Karl Wright added a comment - Hmm. I merged the fix into 10.7.1.1, and ran a test, and got a deadlock hang again. Strange. I'm going to have to try the same case with a straight trunk build just to be sure I didn't mess anything up.
          Hide
          Karl Wright added a comment -

          A trunk build does the same thing on my test case. I got a thread dump when I thought it had locked itself up. There were a whole pile of threads all waiting for grants, and then there was this thread:

          Thread-24341" daemon prio=6 tid=0x05934400 nid=0x1cd4 runnable [0x0816e000]
          java.lang.Thread.State: RUNNABLE
          at java.util.Hashtable.get(Hashtable.java:333)

          • locked <0x29280528> (a java.util.Hashtable)
            at org.apache.derby.shared.common.sanity.SanityManager.DEBUG_ON(SanityManager.java:229)
            at org.apache.derby.impl.services.locks.AbstractPool.lockObject(AbstractPool.java:101)
            at org.apache.derby.impl.store.raw.xact.RowLocking2.lockRecordForRead(RowLocking2.java:165)
            at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520)
            at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:280)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105)
            at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305)
            at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599)
            at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261)
            at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
            at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139)
            at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103)
            at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source)
            at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source)
            at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273)
            at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185)
            at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
            at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440)
            at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264)
            at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
            at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
            at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)
          • locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
            at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
            at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
            at org.apache.manifoldcf.core.database.Database.execute(Database.java:606)
            at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421)

          A thread dump a couple of minutes later was identical, except for that one thread, which now looked like this:

          "Thread-24341" daemon prio=6 tid=0x05934400 nid=0x1cd4 runnable [0x0816e000]
          java.lang.Thread.State: RUNNABLE
          at java.lang.Object.hashCode(Native Method)
          at org.apache.derby.impl.services.locks.Lock.hashCode(Lock.java:145)
          at java.util.HashMap.removeEntryForKey(HashMap.java:548)
          at java.util.HashMap.remove(HashMap.java:538)
          at org.apache.derby.impl.services.locks.ConcurrentLockSet.unlockReference(ConcurrentLockSet.java:799)
          at org.apache.derby.impl.services.locks.LockSpace.unlockReference(LockSpace.java:276)

          • locked <0x29b14878> (a org.apache.derby.impl.services.locks.LockSpace)
            at org.apache.derby.impl.services.locks.AbstractPool.unlock(AbstractPool.java:181)
            at org.apache.derby.impl.store.raw.xact.RowLocking2.unlockRecordAfterRead(RowLocking2.java:185)
            at org.apache.derby.impl.store.access.heap.HeapController.unlockRowAfterRead(HeapController.java:668)
            at org.apache.derby.impl.store.access.btree.index.B2IRowLocking2.unlockScanRecordAfterRead(B2IRowLocking2.java:96)
            at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:221)
            at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599)
            at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261)
            at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
            at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139)
            at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103)
            at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source)
            at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source)
            at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141)
            at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273)
            at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185)
            at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
            at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440)
            at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264)
            at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
            at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
            at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)
          • locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
            at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
            at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
            at org.apache.manifoldcf.core.database.Database.execute(Database.java:606)
            at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421)

          This could just be a very-long-running query, blocking lots of other stuff. I'll have to run with the standard tricks to see what the locks wind up looking like. Stay tuned.

          Show
          Karl Wright added a comment - A trunk build does the same thing on my test case. I got a thread dump when I thought it had locked itself up. There were a whole pile of threads all waiting for grants, and then there was this thread: Thread-24341" daemon prio=6 tid=0x05934400 nid=0x1cd4 runnable [0x0816e000] java.lang.Thread.State: RUNNABLE at java.util.Hashtable.get(Hashtable.java:333) locked <0x29280528> (a java.util.Hashtable) at org.apache.derby.shared.common.sanity.SanityManager.DEBUG_ON(SanityManager.java:229) at org.apache.derby.impl.services.locks.AbstractPool.lockObject(AbstractPool.java:101) at org.apache.derby.impl.store.raw.xact.RowLocking2.lockRecordForRead(RowLocking2.java:165) at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520) at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:280) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599) at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105) at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305) at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599) at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261) at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218) at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139) at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103) at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source) at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source) at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273) at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185) at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127) at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440) at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264) at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242) locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308) at org.apache.manifoldcf.core.database.Database.execute(Database.java:606) at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421) A thread dump a couple of minutes later was identical, except for that one thread, which now looked like this: "Thread-24341" daemon prio=6 tid=0x05934400 nid=0x1cd4 runnable [0x0816e000] java.lang.Thread.State: RUNNABLE at java.lang.Object.hashCode(Native Method) at org.apache.derby.impl.services.locks.Lock.hashCode(Lock.java:145) at java.util.HashMap.removeEntryForKey(HashMap.java:548) at java.util.HashMap.remove(HashMap.java:538) at org.apache.derby.impl.services.locks.ConcurrentLockSet.unlockReference(ConcurrentLockSet.java:799) at org.apache.derby.impl.services.locks.LockSpace.unlockReference(LockSpace.java:276) locked <0x29b14878> (a org.apache.derby.impl.services.locks.LockSpace) at org.apache.derby.impl.services.locks.AbstractPool.unlock(AbstractPool.java:181) at org.apache.derby.impl.store.raw.xact.RowLocking2.unlockRecordAfterRead(RowLocking2.java:185) at org.apache.derby.impl.store.access.heap.HeapController.unlockRowAfterRead(HeapController.java:668) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking2.unlockScanRecordAfterRead(B2IRowLocking2.java:96) at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:221) at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599) at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261) at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218) at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139) at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103) at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source) at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source) at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273) at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185) at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127) at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440) at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264) at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242) locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308) at org.apache.manifoldcf.core.database.Database.execute(Database.java:606) at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421) This could just be a very-long-running query, blocking lots of other stuff. I'll have to run with the standard tricks to see what the locks wind up looking like. Stay tuned.
          Hide
          Karl Wright added a comment -

          Doesn't act like a very long running query; just doesn't make any progress at all, but nevertheless still uses a CPU.

          Here's the locks table now, apologies for the formatting:

          ij> select * from syscs_diag.lock_table;
          XID |TYPE |MODE|TABLENAME

          LOCK
          NAME
          STATE TABLETYPE LOCK& INDEXNAME

          --------------------------------------------------------------------------------
          --------------------------------------------------------------------------------
          --------------------------------------------------------------------------------
          --------------------------------------------------------------------------------
          -------
          43048 |ROW |S |INTRINSICLINK

          (409
          ,16)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          48)
          GRANT T 1 NULL

          43704 |ROW |S |JOBQUEUE

          (65,
          48)
          WAIT T 0 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,60)
          GRANT T 1 NULL

          43048 |ROW |S |INTRINSICLINK

          (409
          ,24)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,61)
          GRANT T 1 NULL

          43048 |ROW |S |INTRINSICLINK

          (409
          ,23)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,58)
          GRANT T 1 NULL

          43629 |ROW |X |JOBQUEUE

          (36,
          73)
          GRANT T 11 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,59)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,62)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,63)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          60)
          GRANT T 1 NULL

          43048 |TABLE|IS |INTRINSICLINK

          Tabl
          elock
          GRANT T 1 NULL

          43619 |TABLE|IX |INTRINSICLINK

          Tabl
          elock
          GRANT T 3 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,53)
          GRANT T 1 NULL

          43048 |ROW |S |INTRINSICLINK

          (409
          ,32)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          59)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,52)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,51)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          58)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,50)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          57)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,57)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,56)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,55)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,54)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          61)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          52)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          51)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          50)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          49)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,49)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          56)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (288
          ,48)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          55)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          54)
          GRANT T 1 NULL

          43619 |ROW |X |JOBQUEUE

          (65,
          53)
          GRANT T 1 NULL

          43629 |ROW |X |PREREQEVENTS

          (160
          ,11)
          GRANT T 2 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,9)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,10)
          GRANT T 1 NULL

          43048 |TABLE|IX |HOPCOUNT

          Tabl
          elock
          GRANT T 2 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,11)
          GRANT T 1 NULL

          43629 |ROW |X |PREREQEVENTS

          (160
          ,12)
          GRANT T 2 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,12)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,13)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,14)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,15)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,16)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,6)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,7)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,8)
          GRANT T 1 NULL

          43048 |ROW |S |INTRINSICLINK

          (410
          ,21)
          GRANT T 1 NULL

          43048 |ROW |X |JOBQUEUE

          (46,
          42)
          GRANT T 11 NULL

          43127 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43258 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43268 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43283 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43284 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43290 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43542 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43629 |ROW |S |JOBQUEUE

          (46,
          42)
          WAIT T 0 NULL

          43048 |ROW |X |PREREQEVENTS

          (200
          ,70)
          GRANT T 2 NULL

          43048 |ROW |X |PREREQEVENTS

          (200
          ,71)
          GRANT T 2 NULL

          43048 |TABLE|IX |CARRYDOWN

          Tabl
          elock
          GRANT T 4 NULL

          43127 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43258 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43268 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43283 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43284 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43290 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43542 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 2 NULL

          43629 |TABLE|IS |CARRYDOWN

          Tabl
          elock
          GRANT T 1 NULL

          43048 |TABLE|IX |PREREQEVENTS

          Tabl
          elock
          GRANT T 2 NULL

          43619 |TABLE|IX |PREREQEVENTS

          Tabl
          elock
          GRANT T 56 NULL

          43629 |TABLE|IX |PREREQEVENTS

          Tabl
          elock
          GRANT T 2 NULL

          43048 |ROW |S |INTRINSICLINK

          (411
          ,24)
          GRANT T 1 NULL

          43619 |ROW |X |INTRINSICLINK

          (411
          ,24)
          WAIT T 0 NULL

          43048 |ROW |U |HOPCOUNT

          (52,
          16)
          GRANT T 1 NULL

          43619 |ROW |X |PREREQEVENTS

          (289
          ,17)
          GRANT T 1 NULL

          43048 |ROW |S |INTRINSICLINK

          (408
          ,7)
          GRANT T 1 NULL

          43048 |ROW |S |INTRINSICLINK

          (411
          ,15)
          GRANT T 1 NULL

          43048 |TABLE|IX |JOBQUEUE

          Tabl
          elock
          GRANT T 11 NULL

          43127 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43258 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43268 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43283 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43284 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43290 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43542 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43619 |TABLE|IX |JOBQUEUE

          Tabl
          elock
          GRANT T 174 NULL

          43629 |TABLE|IX |JOBQUEUE

          Tabl
          elock
          GRANT T 11 NULL

          43629 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 2 NULL

          43704 |TABLE|IS |JOBQUEUE

          Tabl
          elock
          GRANT T 1 NULL

          94 rows selected

          Show
          Karl Wright added a comment - Doesn't act like a very long running query; just doesn't make any progress at all, but nevertheless still uses a CPU. Here's the locks table now, apologies for the formatting: ij> select * from syscs_diag.lock_table; XID |TYPE |MODE|TABLENAME LOCK NAME STATE TABLETYPE LOCK& INDEXNAME -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ------- 43048 |ROW |S |INTRINSICLINK (409 ,16) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 48) GRANT T 1 NULL 43704 |ROW |S |JOBQUEUE (65, 48) WAIT T 0 NULL 43619 |ROW |X |PREREQEVENTS (288 ,60) GRANT T 1 NULL 43048 |ROW |S |INTRINSICLINK (409 ,24) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,61) GRANT T 1 NULL 43048 |ROW |S |INTRINSICLINK (409 ,23) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,58) GRANT T 1 NULL 43629 |ROW |X |JOBQUEUE (36, 73) GRANT T 11 NULL 43619 |ROW |X |PREREQEVENTS (288 ,59) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,62) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,63) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 60) GRANT T 1 NULL 43048 |TABLE|IS |INTRINSICLINK Tabl elock GRANT T 1 NULL 43619 |TABLE|IX |INTRINSICLINK Tabl elock GRANT T 3 NULL 43619 |ROW |X |PREREQEVENTS (288 ,53) GRANT T 1 NULL 43048 |ROW |S |INTRINSICLINK (409 ,32) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 59) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,52) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,51) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 58) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,50) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 57) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,57) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,56) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,55) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,54) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 61) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 52) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 51) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 50) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 49) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,49) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 56) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (288 ,48) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 55) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 54) GRANT T 1 NULL 43619 |ROW |X |JOBQUEUE (65, 53) GRANT T 1 NULL 43629 |ROW |X |PREREQEVENTS (160 ,11) GRANT T 2 NULL 43619 |ROW |X |PREREQEVENTS (289 ,9) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,10) GRANT T 1 NULL 43048 |TABLE|IX |HOPCOUNT Tabl elock GRANT T 2 NULL 43619 |ROW |X |PREREQEVENTS (289 ,11) GRANT T 1 NULL 43629 |ROW |X |PREREQEVENTS (160 ,12) GRANT T 2 NULL 43619 |ROW |X |PREREQEVENTS (289 ,12) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,13) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,14) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,15) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,16) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,6) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,7) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,8) GRANT T 1 NULL 43048 |ROW |S |INTRINSICLINK (410 ,21) GRANT T 1 NULL 43048 |ROW |X |JOBQUEUE (46, 42) GRANT T 11 NULL 43127 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43258 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43268 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43283 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43284 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43290 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43542 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43629 |ROW |S |JOBQUEUE (46, 42) WAIT T 0 NULL 43048 |ROW |X |PREREQEVENTS (200 ,70) GRANT T 2 NULL 43048 |ROW |X |PREREQEVENTS (200 ,71) GRANT T 2 NULL 43048 |TABLE|IX |CARRYDOWN Tabl elock GRANT T 4 NULL 43127 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43258 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43268 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43283 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43284 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43290 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43542 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 2 NULL 43629 |TABLE|IS |CARRYDOWN Tabl elock GRANT T 1 NULL 43048 |TABLE|IX |PREREQEVENTS Tabl elock GRANT T 2 NULL 43619 |TABLE|IX |PREREQEVENTS Tabl elock GRANT T 56 NULL 43629 |TABLE|IX |PREREQEVENTS Tabl elock GRANT T 2 NULL 43048 |ROW |S |INTRINSICLINK (411 ,24) GRANT T 1 NULL 43619 |ROW |X |INTRINSICLINK (411 ,24) WAIT T 0 NULL 43048 |ROW |U |HOPCOUNT (52, 16) GRANT T 1 NULL 43619 |ROW |X |PREREQEVENTS (289 ,17) GRANT T 1 NULL 43048 |ROW |S |INTRINSICLINK (408 ,7) GRANT T 1 NULL 43048 |ROW |S |INTRINSICLINK (411 ,15) GRANT T 1 NULL 43048 |TABLE|IX |JOBQUEUE Tabl elock GRANT T 11 NULL 43127 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43258 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43268 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43283 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43284 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43290 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43542 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43619 |TABLE|IX |JOBQUEUE Tabl elock GRANT T 174 NULL 43629 |TABLE|IX |JOBQUEUE Tabl elock GRANT T 11 NULL 43629 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 2 NULL 43704 |TABLE|IS |JOBQUEUE Tabl elock GRANT T 1 NULL 94 rows selected
          Hide
          Karl Wright added a comment -

          More research confirms absolutely no progress of any kind over tens of minutes. The ManifoldCF reports queries running longer than 60 seconds and there have been no such reports, which would have been expected if we were running progressively slower queries. The number of rows in the involved tables is not that high either ( about 64000 in the biggest, 12000 in the second biggest).
          The data is hard to explain WITHOUT deadlock because of the persistence of threads waiting on the database; there is no churn here. And yet if I read the lock dump there's no obvious deadlock, although I am not sure I'm reading it right. Anybody see anything that might be of use?

          By the way, this is a trunk build now.

          Show
          Karl Wright added a comment - More research confirms absolutely no progress of any kind over tens of minutes. The ManifoldCF reports queries running longer than 60 seconds and there have been no such reports, which would have been expected if we were running progressively slower queries. The number of rows in the involved tables is not that high either ( about 64000 in the biggest, 12000 in the second biggest). The data is hard to explain WITHOUT deadlock because of the persistence of threads waiting on the database; there is no churn here. And yet if I read the lock dump there's no obvious deadlock, although I am not sure I'm reading it right. Anybody see anything that might be of use? By the way, this is a trunk build now.
          Hide
          Karl Wright added a comment -

          Hey, it looks like it unwedged itself after some 30 minutes. ManifoldCF then complained about a lot of long-running queries. Some were long running because they were blocked, obviously, but at least one of them is the cause.

          So this is cleared up; it's not a deadlock apparently, just a bad plan.

          Found a query that took more than a minute (1533032 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))]
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 3: 'F1C0F5CC2ED7A5D0B4AB58A8F202F025AA3BD133'
          Found a query that took more than a minute (1521069 ms): [UPDATE intrinsiclink SET isnew=? WHERE jobid=? AND linktype=? AND parentidhash=? AND childidhash=?]
          Parameter 0: 'E'
          Parameter 1: '1300315252437'
          Parameter 2: 'link'
          Parameter 3: '5AD2BBC595BA81D08B9C0FBF7C435692E255154B'
          Parameter 4: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04'
          Found a query that took more than a minute (1531593 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1535056 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1529958 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: '01A734A295128E67F823F4371F084312EF9DBDFD'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1529928 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1530203 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: 'F20EB42B4BDDBD5FE94C20873B5F06DC285F04D0'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1514744 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid) AND t0.jobid=?]
          Parameter 0: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12'
          Parameter 1: '1300315252437'
          Found a query that took more than a minute (1526832 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Found a query that took more than a minute (1524210 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Parameter 0: '037C78E60C534C68924C36E80D6A469B7371987C'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1627441 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
          Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))]
          Parameter 0: 'D'
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 2: '1300315252437'
          Parameter 3: '01A734A295128E67F823F4371F084312EF9DBDFD'
          Parameter 3: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12'
          Parameter 4: 'B'
          Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 3: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996'
          Parameter 4: 'B'
          Found a query that took more than a minute (1627427 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 3: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52'
          Parameter 4: 'B'
          Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 3: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3'
          Parameter 4: 'B'
          Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 3: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3'
          Parameter 4: 'B'
          Found a query that took more than a minute (1627429 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
          Parameter 0: 'D'
          Parameter 1: '-1'
          Parameter 2: '1300315252437'
          Parameter 3: '037C78E60C534C68924C36E80D6A469B7371987C'
          Parameter 4: 'B'
          Found a query that took more than a minute (3095263 ms): [SELECT jobid,CAST(COUNT(dochash) AS BIGINT) AS doccount FROM jobqueue t1 GROUP BY jobid]
          Found a query that took more than a minute (1622113 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: '035F0D4F860F38F039F3C0D4D642D0CA2F83D349'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1627657 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'
          Found a query that took more than a minute (1622790 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
          Parameter 0: '01595CFF50BEE46E068F5AEBB539483F73C672AC'
          Parameter 1: 'B'
          Parameter 2: '1300315252437'

          Show
          Karl Wright added a comment - Hey, it looks like it unwedged itself after some 30 minutes. ManifoldCF then complained about a lot of long-running queries. Some were long running because they were blocked, obviously, but at least one of them is the cause. So this is cleared up; it's not a deadlock apparently, just a bad plan. Found a query that took more than a minute (1533032 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))] Parameter 0: 'D' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 3: 'F1C0F5CC2ED7A5D0B4AB58A8F202F025AA3BD133' Found a query that took more than a minute (1521069 ms): [UPDATE intrinsiclink SET isnew=? WHERE jobid=? AND linktype=? AND parentidhash=? AND childidhash=?] Parameter 0: 'E' Parameter 1: '1300315252437' Parameter 2: 'link' Parameter 3: '5AD2BBC595BA81D08B9C0FBF7C435692E255154B' Parameter 4: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04' Found a query that took more than a minute (1531593 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1535056 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1529958 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: '01A734A295128E67F823F4371F084312EF9DBDFD' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1529928 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1530203 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: 'F20EB42B4BDDBD5FE94C20873B5F06DC285F04D0' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1514744 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid) AND t0.jobid=?] Parameter 0: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12' Parameter 1: '1300315252437' Found a query that took more than a minute (1526832 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Found a query that took more than a minute (1524210 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996' Parameter 1: 'B' Parameter 2: '1300315252437' Parameter 0: '037C78E60C534C68924C36E80D6A469B7371987C' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1627441 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))] Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))] Parameter 0: 'D' Parameter 0: 'D' Parameter 1: '-1' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 2: '1300315252437' Parameter 3: '01A734A295128E67F823F4371F084312EF9DBDFD' Parameter 3: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12' Parameter 4: 'B' Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))] Parameter 0: 'D' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 3: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996' Parameter 4: 'B' Found a query that took more than a minute (1627427 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))] Parameter 0: 'D' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 3: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52' Parameter 4: 'B' Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))] Parameter 0: 'D' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 3: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3' Parameter 4: 'B' Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))] Parameter 0: 'D' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 3: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3' Parameter 4: 'B' Found a query that took more than a minute (1627429 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))] Parameter 0: 'D' Parameter 1: '-1' Parameter 2: '1300315252437' Parameter 3: '037C78E60C534C68924C36E80D6A469B7371987C' Parameter 4: 'B' Found a query that took more than a minute (3095263 ms): [SELECT jobid,CAST(COUNT(dochash) AS BIGINT) AS doccount FROM jobqueue t1 GROUP BY jobid] Found a query that took more than a minute (1622113 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: '035F0D4F860F38F039F3C0D4D642D0CA2F83D349' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1627657 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04' Parameter 1: 'B' Parameter 2: '1300315252437' Found a query that took more than a minute (1622790 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?] Parameter 0: '01595CFF50BEE46E068F5AEBB539483F73C672AC' Parameter 1: 'B' Parameter 2: '1300315252437'
          Hide
          Karl Wright added a comment -

          Turning on query plan output had some interesting results. The following queries are doing full table scans. In all cases, I cannot see why the planner does not use an existing index that was created specifically to support the query. The other interesting thing to note is that NONE of these queries appeared in my list of "long running queries" above, so they were not involved in the 30-minute stall recorded there. But maybe the underlying cause is similar?

          SELECT parentidhash FROM intrinsiclink WHERE (jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?)
          (identical index exists: intrinsiclink: (jobid,linktype,parentidhash,childidhash) )

          SELECT id,distance,linktype FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?)
          (identical index exists: hopcount: (jobid,linktype,parentidhash) )

          SELECT id FROM jobqueue WHERE (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?)
          (identical index exists: jobqueue: (jobid,status) )

          SELECT parentidhash,linktype,distance FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?)
          (identical index exists: hopcount: (jobid,linktype,parentidhash) )

          FWIW, all of these queries use the corresponding index when run under PostgreSQL.
          I'm happy to create a new ticket for this problem, if appropriate. Comments?

          Show
          Karl Wright added a comment - Turning on query plan output had some interesting results. The following queries are doing full table scans. In all cases, I cannot see why the planner does not use an existing index that was created specifically to support the query. The other interesting thing to note is that NONE of these queries appeared in my list of "long running queries" above, so they were not involved in the 30-minute stall recorded there. But maybe the underlying cause is similar? SELECT parentidhash FROM intrinsiclink WHERE (jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) (identical index exists: intrinsiclink: (jobid,linktype,parentidhash,childidhash) ) SELECT id,distance,linktype FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) (identical index exists: hopcount: (jobid,linktype,parentidhash) ) SELECT id FROM jobqueue WHERE (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) (identical index exists: jobqueue: (jobid,status) ) SELECT parentidhash,linktype,distance FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) (identical index exists: hopcount: (jobid,linktype,parentidhash) ) FWIW, all of these queries use the corresponding index when run under PostgreSQL. I'm happy to create a new ticket for this problem, if appropriate. Comments?
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for verifying that the infinite waiting has been fixed, Karl. I think it would be better to file a separate issue for investigating the bad plan. I'm not sure we have any good way of using indexes for the above query currently, though. Except doing a range scan of the index using the smallest and largest value of (jobid,linktype,parentidhash) in the supplied list of predicates, but that may very well degenerate to a full scan in many cases, depending on the actual values. We do have something called multi-probe scans that might be useful for such queries, but it is my understanding that they are only used for single-column keys.

          One quick check that might indicate whether the problem is the optimizer or we're just lacking the proper access methods to use the indexes efficiently, would be to add an optimizer override to force the use of the index - http://db.apache.org/derby/docs/10.7/tuning/ctundepthoptover.html.

          You may also want to make sure that the optimizer has up to date statistics for the indexes by calling SYSCS_UTIL.SYSCS_UPDATE_STATISTICS on the tables involved in the query - http://db.apache.org/derby/docs/10.7/ref/rrefupdatestatsproc.html.

          Show
          Knut Anders Hatlen added a comment - Thanks for verifying that the infinite waiting has been fixed, Karl. I think it would be better to file a separate issue for investigating the bad plan. I'm not sure we have any good way of using indexes for the above query currently, though. Except doing a range scan of the index using the smallest and largest value of (jobid,linktype,parentidhash) in the supplied list of predicates, but that may very well degenerate to a full scan in many cases, depending on the actual values. We do have something called multi-probe scans that might be useful for such queries, but it is my understanding that they are only used for single-column keys. One quick check that might indicate whether the problem is the optimizer or we're just lacking the proper access methods to use the indexes efficiently, would be to add an optimizer override to force the use of the index - http://db.apache.org/derby/docs/10.7/tuning/ctundepthoptover.html . You may also want to make sure that the optimizer has up to date statistics for the indexes by calling SYSCS_UTIL.SYSCS_UPDATE_STATISTICS on the tables involved in the query - http://db.apache.org/derby/docs/10.7/ref/rrefupdatestatsproc.html .
          Hide
          Karl Wright added a comment -

          The update statistics macro I cannot get to run. For all schemas I give it, except SYS, it tells me the schema is not found, even though I have explicitly created it. It doesn't accept APP either. For SYS, it complains that ALTER TABLE is not allowed.

          In ij, "SHOW SCHEMAS" does not list all the schemas I've created either:

          TABLE_SCHEM
          ------------------------------
          APP
          NULLID
          SQLJ
          SYS
          SYSCAT
          SYSCS_DIAG
          SYSCS_UTIL
          SYSFUN
          SYSIBM
          SYSPROC
          SYSSTAT

          11 rows selected

          .... so I am not sure what is going on with this.

          Show
          Karl Wright added a comment - The update statistics macro I cannot get to run. For all schemas I give it, except SYS, it tells me the schema is not found, even though I have explicitly created it. It doesn't accept APP either. For SYS, it complains that ALTER TABLE is not allowed. In ij, "SHOW SCHEMAS" does not list all the schemas I've created either: TABLE_SCHEM ------------------------------ APP NULLID SQLJ SYS SYSCAT SYSCS_DIAG SYSCS_UTIL SYSFUN SYSIBM SYSPROC SYSSTAT 11 rows selected .... so I am not sure what is going on with this.
          Hide
          Karl Wright added a comment -

          More research shows that the problem is that you cannot calculate statistics for tables that are part of the app or default schema. Tables part of any other schema are OK.

          Show
          Karl Wright added a comment - More research shows that the problem is that you cannot calculate statistics for tables that are part of the app or default schema. Tables part of any other schema are OK.
          Hide
          Karl Wright added a comment -

          Never mind; it cares about case for the table name. Converting to upper case fixed the problem.

          Show
          Karl Wright added a comment - Never mind; it cares about case for the table name. Converting to upper case fixed the problem.
          Hide
          Karl Wright added a comment -

          Created a new ticket for the optimizer problem, DERBY-5142. As far as I'm concerned, this current ticket is resolved.

          Show
          Karl Wright added a comment - Created a new ticket for the optimizer problem, DERBY-5142 . As far as I'm concerned, this current ticket is resolved.
          Hide
          Knut Anders Hatlen added a comment -

          I've added some more comments to the Deadlock class (see the patch comments.diff):

          • a high level description of the algorithm in the class javadoc
          • more javadoc for the methods
          • expanded on the comments inside the actual detection algorithm in Deadlock.look()

          Committed revision 1084561.

          Show
          Knut Anders Hatlen added a comment - I've added some more comments to the Deadlock class (see the patch comments.diff): a high level description of the algorithm in the class javadoc more javadoc for the methods expanded on the comments inside the actual detection algorithm in Deadlock.look() Committed revision 1084561.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for all the help with tracking this down, Karl, and for verifying the the fix! I'm resolving the issue as a duplicate of DERBY-3980 since the two issues turned out to be caused by the same underlying bug(s) in the deadlock detection algorithm.

          Show
          Knut Anders Hatlen added a comment - Thanks for all the help with tracking this down, Karl, and for verifying the the fix! I'm resolving the issue as a duplicate of DERBY-3980 since the two issues turned out to be caused by the same underlying bug(s) in the deadlock detection algorithm.
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Karl Wright
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development