Derby
  1. Derby
  2. DERBY-4080

Deadlock between locks and latches in BTreeController.compareRowsForInsert()

    Details

    • Urgency:
      Normal
    • Issue & fix info:
      Repro attached

      Description

      It looks like BTreeController.compareRowsForInsert(), which is used to check for duplicates in a unique nullable index, can run into a deadlock which involves both locks and latches.

      Here's what I think can happen:

      comparePreviousRecord() (or compareNextRecord()) holds a latch on the index page where a new row is about to be inserted, and needs to check if there's a duplicate on one of the adjacent rows. Because the row is near a page boundary, this check moves to another index page, while still holding the latch on the original index page. Then compareRowsForInsert() is called, which tries to get an update lock on the existing row. If it has to wait for the update lock, the latch on the current page is released, but the latch on the original index page is kept. This means that the transaction is holding a latch while it is waiting for a lock, which means that it is blocking all access to that page until it has been granted the lock. If some other transaction that is holding a conflicting lock on the row later needs to latch the index page, those two transactions run into a deadlock and the one that's waiting for the lock will eventually time out (but it will not be reported as a deadlock).

      If compareRowsForInsert() releases all latches when it needs to wait for a lock, the deadlock is prevented, and both of the transactions may be able to complete without timing out.

      1. repro.sql
        1 kB
        Knut Anders Hatlen

        Activity

        Hide
        Mike Matrigali added a comment -

        ran script against trunk and it gets lock timeout at end.

        Show
        Mike Matrigali added a comment - ran script against trunk and it gets lock timeout at end.
        Hide
        Mike Matrigali added a comment -

        Triaged for 10.9, no changes.

        Show
        Mike Matrigali added a comment - Triaged for 10.9, no changes.
        Hide
        Knut Anders Hatlen added a comment -

        Removed "possible" from the summary to make it clear that the deadlock has been observed (using the repro that's attached).

        Show
        Knut Anders Hatlen added a comment - Removed "possible" from the summary to make it clear that the deadlock has been observed (using the repro that's attached).
        Hide
        Rick Hillegas added a comment -

        Triaged July 2, 2009: Assigning normal urgency.

        Show
        Rick Hillegas added a comment - Triaged July 2, 2009: Assigning normal urgency.
        Hide
        Knut Anders Hatlen added a comment -

        Here's a script that exposes the bug (seen in 10.4.2.1 - (706043) and 10.5.0.0 alpha - (749659)).

        The script inserts a row at the end of one index page, with an uncommitted deleted duplicate at the beginning of the next index page. The insert must wait for the uncommitted delete to be committed. The transaction that holds the exclusive lock on the deleted duplicate then tries to read a row on the previous index page, but cannot obtain the lock until the insert operation times out.

        A thread dump during the hang shows that this is a deadlock involving both locks and latches:

        "Thread-2" prio=3 tid=0x08473800 nid=0x12 in Object.wait() [0xb60de000..0xb60debe0]
        java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)

        • waiting on <0xf3c06298> (a org.apache.derby.impl.services.locks.ActiveLock)
          at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(Unknown Source)
        • locked <0xf3c06298> (a org.apache.derby.impl.services.locks.ActiveLock)
          at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(Unknown Source)
          .
          .
          .
          "main" prio=3 tid=0x0806f400 nid=0x2 in Object.wait() [0xfe34e000..0xfe34ed38]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
        • waiting on <0xf4235668> (a org.apache.derby.impl.store.raw.data.StoredPage)
          at java.lang.Object.wait(Object.java:485)
          at org.apache.derby.impl.store.raw.data.BasePage.setExclusive(Unknown Source)
        • locked <0xf4235668> (a org.apache.derby.impl.store.raw.data.StoredPage)
          at org.apache.derby.impl.store.raw.data.BaseContainer.latchPage(Unknown Source)

        If the transaction that waits for the row lock had released all latches once it detected that it would have to wait, there would not be a deadlock and both transactions would be able to complete successfully. Once it has obtained the lock, it will release the latch on the page that it "forgot" to unlatch and perform a rescan, so I would believe that it is fine to release the latch earlier in this case.

        Show
        Knut Anders Hatlen added a comment - Here's a script that exposes the bug (seen in 10.4.2.1 - (706043) and 10.5.0.0 alpha - (749659)). The script inserts a row at the end of one index page, with an uncommitted deleted duplicate at the beginning of the next index page. The insert must wait for the uncommitted delete to be committed. The transaction that holds the exclusive lock on the deleted duplicate then tries to read a row on the previous index page, but cannot obtain the lock until the insert operation times out. A thread dump during the hang shows that this is a deadlock involving both locks and latches: "Thread-2" prio=3 tid=0x08473800 nid=0x12 in Object.wait() [0xb60de000..0xb60debe0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0xf3c06298> (a org.apache.derby.impl.services.locks.ActiveLock) at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(Unknown Source) locked <0xf3c06298> (a org.apache.derby.impl.services.locks.ActiveLock) at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(Unknown Source) . . . "main" prio=3 tid=0x0806f400 nid=0x2 in Object.wait() [0xfe34e000..0xfe34ed38] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0xf4235668> (a org.apache.derby.impl.store.raw.data.StoredPage) at java.lang.Object.wait(Object.java:485) at org.apache.derby.impl.store.raw.data.BasePage.setExclusive(Unknown Source) locked <0xf4235668> (a org.apache.derby.impl.store.raw.data.StoredPage) at org.apache.derby.impl.store.raw.data.BaseContainer.latchPage(Unknown Source) If the transaction that waits for the row lock had released all latches once it detected that it would have to wait, there would not be a deadlock and both transactions would be able to complete successfully. Once it has obtained the lock, it will release the latch on the page that it "forgot" to unlatch and perform a rescan, so I would believe that it is fine to release the latch earlier in this case.

          People

          • Assignee:
            Unassigned
            Reporter:
            Knut Anders Hatlen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development