Derby
  1. Derby
  2. DERBY-2877

Print the entire lock list when a deadlock occurs and deadlock tracing is on

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 10.2.2.0
    • Fix Version/s: None
    • Component/s: Services
    • Labels:
      None

      Description

      When a deadlock occurs, derby includes the cycle of locks which caused the deadlock in the SQLException message. This is also printed to derby.log if the properties derby.locks.deadlockTrace and derby.locks.monitor are set to true, or if debug code is being used (e.g. jars from lib-debug distributions). It will be easier to debug deadlocks if the entire lock table is printed to derby.log as well (alternatively to both derby.log and the exception message) in these cases. An example of such a lock table is available at http://wiki.apache.org/db-derby/LockDebugging.

      For example, in a long-running test I have observed deadlocks with lock cycle messages such as:

      Lock : ROW, DELETED, (2,1)
      Waiting XID :

      {6241401573, S}

      , U1, DELETE FROM "U1"."DELETED" WHERE CURRENT OF "SQL_CURLH000C9"
      Granted XID :

      {6241401662, S}


      Lock : ROW, DELETED, (3,3523)
      Waiting XID :

      {6241401662, U}

      , U1, SELECT ITEMID FROM DELETED
      Granted XID :

      {6241401573, U}


      . The selected victim is XID : 6241401573.

      It is not clear from this output why XID 6241401573 is waiting for a shared lock (S) on row (2,1), as an S lock is compatible with other S locks [1].

      Having a snapshot of the contents of the lock table at the time of the deadlock would probably help a great deal in the debugging process.

      [1]: Lock compatibility: http://db.apache.org/derby/docs/dev/devguide/rdevconcepts2462.html

        Issue Links

          Activity

          Hide
          Daniel John Debrunner added a comment -

          The entire contents of the lock table could also hinder debugging by providing too much information, making it hard to find the actual problem. Imagine an active system with 100 users and two of them deadlock, now one would have to sift through 98 unrelated transactions.

          Maybe the approach could be to figure out what would make the deadlock message more useful while displaying only the relevant information.
          E.g. in the example deadlock you have above, possibly there is a bug in what information is being displayed, maybe fixing the bug would be a better approach than drowning the log with too much information?

          Show
          Daniel John Debrunner added a comment - The entire contents of the lock table could also hinder debugging by providing too much information, making it hard to find the actual problem. Imagine an active system with 100 users and two of them deadlock, now one would have to sift through 98 unrelated transactions. Maybe the approach could be to figure out what would make the deadlock message more useful while displaying only the relevant information. E.g. in the example deadlock you have above, possibly there is a bug in what information is being displayed, maybe fixing the bug would be a better approach than drowning the log with too much information?
          Hide
          John H. Embretsen added a comment -

          Good point, it would perhaps be too much information in a system with a very large number of users/threads/connections. Still, this information is already printed with lock wait timeouts, which I guess occur more frequently. So I'd still like to have the possibility to get the lock list at deadlocks too. Perhaps a new property would be in order for this?

          I considered the possibility that there is a bug in the code making the deadlock message, but I have no idea what it could be without seeing the entire lock table at the appropriate time. Other hints are appreciated. So far I have not been able to reproduce the deadlock outside the long-running test, in which it seems to occur for the first time after having run for about a week.

          Show
          John H. Embretsen added a comment - Good point, it would perhaps be too much information in a system with a very large number of users/threads/connections. Still, this information is already printed with lock wait timeouts, which I guess occur more frequently. So I'd still like to have the possibility to get the lock list at deadlocks too. Perhaps a new property would be in order for this? I considered the possibility that there is a bug in the code making the deadlock message, but I have no idea what it could be without seeing the entire lock table at the appropriate time. Other hints are appreciated. So far I have not been able to reproduce the deadlock outside the long-running test, in which it seems to occur for the first time after having run for about a week.
          Hide
          Knut Anders Hatlen added a comment -

          I looks to me that the deadlock in John's example is caused by an index split deadlock like the one reported in DERBY-2991, only that it involves more transaction and therefore is detected as a deadlock, not as a timeout. There is probably an index split waiting for an X lock on (2,1), and that's what preventing the delete transaction from obtaining a shared lock. Perhaps we would get enough debugging information if we printed all waiters for the locks in the cycle.

          Show
          Knut Anders Hatlen added a comment - I looks to me that the deadlock in John's example is caused by an index split deadlock like the one reported in DERBY-2991 , only that it involves more transaction and therefore is detected as a deadlock, not as a timeout. There is probably an index split waiting for an X lock on (2,1), and that's what preventing the delete transaction from obtaining a shared lock. Perhaps we would get enough debugging information if we printed all waiters for the locks in the cycle.
          Hide
          Knut Anders Hatlen added a comment -

          If we'd had a way to reproduce this deadlock, it would have been interesting to see if the fix for DERBY-3980 and DERBY-5073 changed how it was reported. One of the problems fixed there was that a deadlock could sometimes be detected prematurely, before the the full cycle had been seen. The observed consequence of that problem was that the transactions involved in the deadlock never agreed on which transaction to abort, because none of the transactions thought they were involved in the deadlock themselves. I'm wondering, though, if it could also result in incomplete cycles being reported, like the one in this bug report. But I haven't been able to reproduce such a deadlock to see for myself.

          The deadlock in the bug report is caused by the index split problems fixed in DERBY-2991, so the exact same deadlock won't happen again on trunk or any recent releases. Since we don't have a repro for a deadlock where printing the entire lock table would be useful in diagnosing the deadlock, and since we've fixed problems in the deadlock detection algorithm that may have changed how these deadlocks are reported, would it make sense to resolve this issue as "Cannot Reproduce"? If a deadlock with an incomplete cycle in the error message ever occurs again, we can reopen this issue or file a new bug report.

          Show
          Knut Anders Hatlen added a comment - If we'd had a way to reproduce this deadlock, it would have been interesting to see if the fix for DERBY-3980 and DERBY-5073 changed how it was reported. One of the problems fixed there was that a deadlock could sometimes be detected prematurely, before the the full cycle had been seen. The observed consequence of that problem was that the transactions involved in the deadlock never agreed on which transaction to abort, because none of the transactions thought they were involved in the deadlock themselves. I'm wondering, though, if it could also result in incomplete cycles being reported, like the one in this bug report. But I haven't been able to reproduce such a deadlock to see for myself. The deadlock in the bug report is caused by the index split problems fixed in DERBY-2991 , so the exact same deadlock won't happen again on trunk or any recent releases. Since we don't have a repro for a deadlock where printing the entire lock table would be useful in diagnosing the deadlock, and since we've fixed problems in the deadlock detection algorithm that may have changed how these deadlocks are reported, would it make sense to resolve this issue as "Cannot Reproduce"? If a deadlock with an incomplete cycle in the error message ever occurs again, we can reopen this issue or file a new bug report.
          Hide
          Kathey Marsden added a comment -

          I had always seen this issue as a diagnostic improvement to help users track down deadlocks in their applications rather than a diagnostic to track down Derby bugs. I thought about closing this after Lily added extended diagnostics with DERBY-4856 where derby.stream.error.extendedDiagSeverityLevel can be modified to do a thread dump on deadlock, but I decided that DERBY-2877 would still be useful because it would restrict the thread dump print to just the threads involved in the deadlock. But maybe I am misunderstanding this issue.

          Show
          Kathey Marsden added a comment - I had always seen this issue as a diagnostic improvement to help users track down deadlocks in their applications rather than a diagnostic to track down Derby bugs. I thought about closing this after Lily added extended diagnostics with DERBY-4856 where derby.stream.error.extendedDiagSeverityLevel can be modified to do a thread dump on deadlock, but I decided that DERBY-2877 would still be useful because it would restrict the thread dump print to just the threads involved in the deadlock. But maybe I am misunderstanding this issue.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Kathey,

          I think this issue was logged to get a dump of the lock table, not a thread dump.

          My understanding was that the need for this was triggered by a bug in Derby's deadlock detection/reporting that caused some deadlocks to be reported without enough information to tell what had deadlocked, like the deadlock in the issue description. The printed deadlock cycle clearly isn't a cycle, so either the full cycle wasn't printed, or the deadlock error was a false positive.

          If the underlying bug is fixed so that deadlock errors always report all locks involved in the deadlock, there should be no need to dump the full lock table on a deadlock. The deadlock error message is supposed to show information about all involved locks and be self-sufficient. We don't know if the underlying bug is fixed, since we don't have a repro, but there is a possibility that it has the same root cause as DERBY-3980 and DERBY-5073.

          Show
          Knut Anders Hatlen added a comment - Hi Kathey, I think this issue was logged to get a dump of the lock table, not a thread dump. My understanding was that the need for this was triggered by a bug in Derby's deadlock detection/reporting that caused some deadlocks to be reported without enough information to tell what had deadlocked, like the deadlock in the issue description. The printed deadlock cycle clearly isn't a cycle, so either the full cycle wasn't printed, or the deadlock error was a false positive. If the underlying bug is fixed so that deadlock errors always report all locks involved in the deadlock, there should be no need to dump the full lock table on a deadlock. The deadlock error message is supposed to show information about all involved locks and be self-sufficient. We don't know if the underlying bug is fixed, since we don't have a repro, but there is a possibility that it has the same root cause as DERBY-3980 and DERBY-5073 .
          Hide
          Kathey Marsden added a comment -

          I am sorry, I was confusing this with issue DERBY-666 which is the thread dump issue. Ok to close this one. Sorry for the noise.

          Show
          Kathey Marsden added a comment - I am sorry, I was confusing this with issue DERBY-666 which is the thread dump issue. Ok to close this one. Sorry for the noise.
          Hide
          John H. Embretsen added a comment -

          I am not in a position to be able to reproduce the deadlock issue I was seeing when I reported this issue. Closing the issue is OK for me, assuming the underlying bugs have been fixed. I fully support Knut Anders' earlier statement:

          "If a deadlock with an incomplete cycle in the error message ever occurs again, we can reopen this issue or file a new bug report. "

          Show
          John H. Embretsen added a comment - I am not in a position to be able to reproduce the deadlock issue I was seeing when I reported this issue. Closing the issue is OK for me, assuming the underlying bugs have been fixed. I fully support Knut Anders' earlier statement: "If a deadlock with an incomplete cycle in the error message ever occurs again, we can reopen this issue or file a new bug report. "
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, John and Kathey. I'm resolving the issue with the resolution 'Cannot Reproduce' because this deadlock cannot happen after DERBY-2991, and on the assumption that DERBY-3980/DERBY-5073 would have changed the reporting if it were reproducible.

          Show
          Knut Anders Hatlen added a comment - Thanks, John and Kathey. I'm resolving the issue with the resolution 'Cannot Reproduce' because this deadlock cannot happen after DERBY-2991 , and on the assumption that DERBY-3980 / DERBY-5073 would have changed the reporting if it were reproducible.
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.

            People

            • Assignee:
              Unassigned
              Reporter:
              John H. Embretsen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development