Derby
  1. Derby
  2. DERBY-2107

Move page latching out of the lock manager

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 10.3.1.4
    • Fix Version/s: 10.3.1.4
    • Component/s: Services, Store
    • Labels:
      None
    • Bug behavior facts:
      Performance

      Description

      Latching of pages could be done more efficiently locally in store than in the lock manager. See the discussion here: http://thread.gmane.org/gmane.comp.apache.db.derby.devel/33135

      1. derby-2107-1a.diff
        22 kB
        Knut Anders Hatlen
      2. derby-2107-1a.stat
        0.5 kB
        Knut Anders Hatlen
      3. derby-2107-1b.diff
        23 kB
        Knut Anders Hatlen
      4. derby-2107-1c.diff
        9 kB
        Knut Anders Hatlen
      5. derby-2107-1c.stat
        0.1 kB
        Knut Anders Hatlen
      6. derby-2107-1d.diff
        10 kB
        Knut Anders Hatlen
      7. derby-2107-2a.diff
        9 kB
        Knut Anders Hatlen
      8. derby-2107-2a.stat
        0.2 kB
        Knut Anders Hatlen
      9. derby-2107-2a.diff
        9 kB
        Knut Anders Hatlen
      10. derby-2107-2a.stat
        0.2 kB
        Knut Anders Hatlen
      11. derby-2107-3a.diff
        3 kB
        Knut Anders Hatlen
      12. derby-2107-3a.stat
        0.2 kB
        Knut Anders Hatlen
      13. derby-2107-4a.diff
        2 kB
        Knut Anders Hatlen
      14. derby-2107-4a.stat
        0.1 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Here's a first patch (derby-2107-1a.diff) which implements latching
          locally in the page objects. Description of the changes:

          • BasePage.java:
          • removed all traces of Latch and Lockable
          • setExclusive() and setExclusiveNoWait() now set the variables that
            the page is locked (owner and preLatch) directly, instead of going
            through the lock manager. wait() and notifyAll() are used to
            coordinate with releaseExclusive().

          The rest of the patch affects code that is only called from the unit
          tests for store. I think this is too much complexity for code that is
          never used outside the tests, but I'm just trying to preserve the old
          behaviour for now.

          • LockingPolicy.java:
          • the lockRecordForRead() and lockRecordForWrite() methods that
            took a Latch object was changed to take a Transaction object, a
            Page object and a ContainerHandle object. This was needed in
            order to be able to unlatch the latched page in case a row lock
            couldn't be obtained immediately.
          • NoLocking.java, RowLocking1.java:
          • change the signatures as in LockingPolicy. No implementation
            changes needed.
          • RowLocking2.java, RowLocking3.java:
          • change the signatures as in LockingPolicy.
          • instead of calling the LockFactory.lockObject() method which
            took a Latch parameter, the logic needed to be inlined. First
            try to get a row lock without waiting. If it couldn't be
            obtained immediately, unlatch the page, wait for the lock, and
            re-latch the page when the lock has been obtained.
          • Page.java (interface) and BasePage.java (implementation):
          • added new method Page.latch(ContainerHandle) (which only
            forwards the call to the protected setExclusive() method) which
            was needed for the RowLocking* classes to be able to re-latch the
            page. (Page already has an unlatch() method.)

          I have run derbyall and the JUnit tests successfully. I sometimes see
          an error in lang/compressTable.sql, but the same diff has been seen in
          the nightlies too
          (http://dbtg.thresher.com/derby/test/trunk15/jvm1.5/testing/testlog/Linux-2.6.9-34.ELsmp_x86_64-x86_64/476241-derbylang_diff.txt).
          However, I do see it more frequently with the patch, probably because
          of changes in the timing.

          This patch does not address deadlock detection issues, or timeout when
          the latch cannot be obtained. I think those issues would be better to
          address in follow-up patches (incremental development, right?).

          Reviews and comments would be greatly appreciated! Thanks.

          Show
          Knut Anders Hatlen added a comment - Here's a first patch (derby-2107-1a.diff) which implements latching locally in the page objects. Description of the changes: BasePage.java: removed all traces of Latch and Lockable setExclusive() and setExclusiveNoWait() now set the variables that the page is locked (owner and preLatch) directly, instead of going through the lock manager. wait() and notifyAll() are used to coordinate with releaseExclusive(). The rest of the patch affects code that is only called from the unit tests for store. I think this is too much complexity for code that is never used outside the tests, but I'm just trying to preserve the old behaviour for now. LockingPolicy.java: the lockRecordForRead() and lockRecordForWrite() methods that took a Latch object was changed to take a Transaction object, a Page object and a ContainerHandle object. This was needed in order to be able to unlatch the latched page in case a row lock couldn't be obtained immediately. NoLocking.java, RowLocking1.java: change the signatures as in LockingPolicy. No implementation changes needed. RowLocking2.java, RowLocking3.java: change the signatures as in LockingPolicy. instead of calling the LockFactory.lockObject() method which took a Latch parameter, the logic needed to be inlined. First try to get a row lock without waiting. If it couldn't be obtained immediately, unlatch the page, wait for the lock, and re-latch the page when the lock has been obtained. Page.java (interface) and BasePage.java (implementation): added new method Page.latch(ContainerHandle) (which only forwards the call to the protected setExclusive() method) which was needed for the RowLocking* classes to be able to re-latch the page. (Page already has an unlatch() method.) I have run derbyall and the JUnit tests successfully. I sometimes see an error in lang/compressTable.sql, but the same diff has been seen in the nightlies too ( http://dbtg.thresher.com/derby/test/trunk15/jvm1.5/testing/testlog/Linux-2.6.9-34.ELsmp_x86_64-x86_64/476241-derbylang_diff.txt ). However, I do see it more frequently with the patch, probably because of changes in the timing. This patch does not address deadlock detection issues, or timeout when the latch cannot be obtained. I think those issues would be better to address in follow-up patches (incremental development, right?). Reviews and comments would be greatly appreciated! Thanks.
          Hide
          Daniel John Debrunner added a comment -

          Couple of comments jump out from an initial look at the patch.

          The first may have been an existing issue:

          1) The field nestedLatch is used under synchronization when gettting the latch, but not when releasing the latch.

          2) In setExclusive() if an attempt is made to double latch the page it will suceed, even when the transaction is not in abort. This seems to conflict the comments in the method. This will not be handled correctly on the unlatch as the first unlatch will completely clear the latch leaving the caller thinking it has the latch when it doesn't.
          Not sure what the old code did in this situation.

          Show
          Daniel John Debrunner added a comment - Couple of comments jump out from an initial look at the patch. The first may have been an existing issue: 1) The field nestedLatch is used under synchronization when gettting the latch, but not when releasing the latch. 2) In setExclusive() if an attempt is made to double latch the page it will suceed, even when the transaction is not in abort. This seems to conflict the comments in the method. This will not be handled correctly on the unlatch as the first unlatch will completely clear the latch leaving the caller thinking it has the latch when it doesn't. Not sure what the old code did in this situation.
          Hide
          Daniel John Debrunner added a comment -

          On more, I think there has to be synchronization in isLatched() based upon previous discussions on the list about Java's memory model.

          Show
          Daniel John Debrunner added a comment - On more, I think there has to be synchronization in isLatched() based upon previous discussions on the list about Java's memory model.
          Hide
          Knut Anders Hatlen added a comment -

          > The first may have been an existing issue:
          > 1) The field nestedLatch is used under synchronization when gettting
          > the latch, but not when releasing the latch.

          Thanks, I will move that into the synchronized block.

          > 2) In setExclusive() if an attempt is made to double latch the page
          > it will suceed, even when the transaction is not in abort. This
          > seems to conflict the comments in the method. This will not be
          > handled correctly on the unlatch as the first unlatch will
          > completely clear the latch leaving the caller thinking it has the
          > latch when it doesn't. Not sure what the old code did in this
          > situation.

          Not sure I understand this comment. If one tries to double latch a
          page, and the transaction is not in abort, it will wait until the page
          is unlatched, just as any other transaction. This is the same as the
          old code did, except that the old code would probably detect the
          deadlock after derby.locks.deadlockTimeout seconds. Am I missing
          something?

          > On more, I think there has to be synchronization in isLatched()
          > based upon previous discussions on the list about Java's memory
          > model.

          Yes, I believe that's true. Will fix that too.

          Show
          Knut Anders Hatlen added a comment - > The first may have been an existing issue: > 1) The field nestedLatch is used under synchronization when gettting > the latch, but not when releasing the latch. Thanks, I will move that into the synchronized block. > 2) In setExclusive() if an attempt is made to double latch the page > it will suceed, even when the transaction is not in abort. This > seems to conflict the comments in the method. This will not be > handled correctly on the unlatch as the first unlatch will > completely clear the latch leaving the caller thinking it has the > latch when it doesn't. Not sure what the old code did in this > situation. Not sure I understand this comment. If one tries to double latch a page, and the transaction is not in abort, it will wait until the page is unlatched, just as any other transaction. This is the same as the old code did, except that the old code would probably detect the deadlock after derby.locks.deadlockTimeout seconds. Am I missing something? > On more, I think there has to be synchronization in isLatched() > based upon previous discussions on the list about Java's memory > model. Yes, I believe that's true. Will fix that too.
          Hide
          Daniel John Debrunner added a comment -

          Sorry, I was wrong on issue 2), got confused mentally walking through the code.

          Though the patch does remove this comment: for the situation in 2)

          // just deadlock out ...

          I think some comment should exist there indicating that the code path is not expected or maybe even throw an exception there.
          The patched code I think is confusing without some form of comment as to what is expected.

          Show
          Daniel John Debrunner added a comment - Sorry, I was wrong on issue 2), got confused mentally walking through the code. Though the patch does remove this comment: for the situation in 2) // just deadlock out ... I think some comment should exist there indicating that the code path is not expected or maybe even throw an exception there. The patched code I think is confusing without some form of comment as to what is expected.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a new version of the patch to address Dan's comments. (Improved synchronization and comments.)

          If this patch is accepted, I plan to post follow-up patches to address these issues:
          1) There should be some kind of timeout mechanism so that possible deadlocks are reported.
          2) Tool for debugging of deadlocks. A VTI which shows all latches in the system was suggested on derby-dev.
          3) The extra complexity to handle latches should be removed from the lock manager.

          Show
          Knut Anders Hatlen added a comment - Attaching a new version of the patch to address Dan's comments. (Improved synchronization and comments.) If this patch is accepted, I plan to post follow-up patches to address these issues: 1) There should be some kind of timeout mechanism so that possible deadlocks are reported. 2) Tool for debugging of deadlocks. A VTI which shows all latches in the system was suggested on derby-dev. 3) The extra complexity to handle latches should be removed from the lock manager.
          Hide
          Daniel John Debrunner added a comment -

          Even though the parameters have changed for LockingPolicy.lockRecordForRead its description has not, nor in any of its implementations. Since you have been working in the code and understanding it, it would be good to record knowledge you gained, and improve the java doc comments if they are wrong.

          I think this comment for LockingPoliy.lockRecordForRead needs to be augmented (& I think this is an existing problem ).
          "Lock a record while holding a page latch."
          I think the api should state that the latch might be dropped and re-acquired during this call.

          The ordering & timing of latching/locking has changed in the row locking implementations of LockingPolicy.lockRecordForRead.
          Can you share with the list any evaluation you perfomed of the effect on this, in terms of race conditions between multiple threads?

          Show
          Daniel John Debrunner added a comment - Even though the parameters have changed for LockingPolicy.lockRecordForRead its description has not, nor in any of its implementations. Since you have been working in the code and understanding it, it would be good to record knowledge you gained, and improve the java doc comments if they are wrong. I think this comment for LockingPoliy.lockRecordForRead needs to be augmented (& I think this is an existing problem ). "Lock a record while holding a page latch." I think the api should state that the latch might be dropped and re-acquired during this call. The ordering & timing of latching/locking has changed in the row locking implementations of LockingPolicy.lockRecordForRead. Can you share with the list any evaluation you perfomed of the effect on this, in terms of race conditions between multiple threads?
          Hide
          Mike Matrigali added a comment -

          I am reviewing the changes, and do agree with Dan's comments. Have you thought about the implecations of order change? I am
          trying to think if it is a problem. So far what I am looking at is:

          1) With your new implementation a waiting lock now requires 2 trips to the lock manager rather than the previous 1 trip.
          This was one of the reasons to put latching in the lock manager in the first place, so that the queuing of the waiting
          lock could be handled once rather than twice.

          2) I believe existing implementation the logic for a lock that waits is something like:
          o get exclusive access to lock manager
          o queue waiting lock while holding latch
          o release latch
          o release exclusive access to lock manager
          o wait for grant of lock
          o reclaim latch

          new order is
          o release latch
          o get exclusive access to lock manager
          o queue waiting lock WITHOUT latch
          o release exclusive access to lock manager
          o wait for grant of lock
          o reclaim latch

          The case I am trying to work out in my mind is the space reclamation thread. It gets latches on pages and
          loops through rows requesting zero duration locks to see if it can purge a row from a page. In the first case
          the row we are requesting a lock on will never be purged as it is not possible for the reclaim space thread to
          get a latch on a the page and purge the row we are waiting for a lock on (the lock request for an update lock
          will fail because of our wait). In the second case It can sneak in between the release of the latch and queue
          of the waiting lock.

          It might not be good that the system depends on this behavior, but at this point I don't know how much or if
          other code would need to change to support this lock/latch behavior change. i

          Show
          Mike Matrigali added a comment - I am reviewing the changes, and do agree with Dan's comments. Have you thought about the implecations of order change? I am trying to think if it is a problem. So far what I am looking at is: 1) With your new implementation a waiting lock now requires 2 trips to the lock manager rather than the previous 1 trip. This was one of the reasons to put latching in the lock manager in the first place, so that the queuing of the waiting lock could be handled once rather than twice. 2) I believe existing implementation the logic for a lock that waits is something like: o get exclusive access to lock manager o queue waiting lock while holding latch o release latch o release exclusive access to lock manager o wait for grant of lock o reclaim latch new order is o release latch o get exclusive access to lock manager o queue waiting lock WITHOUT latch o release exclusive access to lock manager o wait for grant of lock o reclaim latch The case I am trying to work out in my mind is the space reclamation thread. It gets latches on pages and loops through rows requesting zero duration locks to see if it can purge a row from a page. In the first case the row we are requesting a lock on will never be purged as it is not possible for the reclaim space thread to get a latch on a the page and purge the row we are waiting for a lock on (the lock request for an update lock will fail because of our wait). In the second case It can sneak in between the release of the latch and queue of the waiting lock. It might not be good that the system depends on this behavior, but at this point I don't know how much or if other code would need to change to support this lock/latch behavior change. i
          Hide
          Daniel John Debrunner added a comment -

          I think the patch has a severe problem, here is code from RowLocking3.lockRecordForRead (one of several instances):

          // Couldn't get the lock immediately. Release the latch while waiting.
          latchedPage.unlatch();

          lf.lockObject(
          t.getCompatibilitySpace(), t, record, qualifier,
          C_LockFactory.TIMED_WAIT);

          latchedPage.latch(containerHandle);

          The api for Page.unlatch() states that the reference used to unlatch the page must be not used (i.e. must be set to null) after calling unlatch().
          This code uses the reference used to unlatch the page after the lock call and the caller of this method will continue to use the same Page reference.
          The reference must be set to null after an unlatch because once the page in unlatched there is no guarantee that the page will remain in cache or that the reference will continue to describe the same page.

          So this code clearly uses a Page reference after it has been unlatched and thus can be referring to a wrong page or an invlaid page.

          I think it works with the current code because a a latch on the Page is a super-set of a lock manager Latch, I'll add more on that in a separate comment.

          Show
          Daniel John Debrunner added a comment - I think the patch has a severe problem, here is code from RowLocking3.lockRecordForRead (one of several instances): // Couldn't get the lock immediately. Release the latch while waiting. latchedPage.unlatch(); lf.lockObject( t.getCompatibilitySpace(), t, record, qualifier, C_LockFactory.TIMED_WAIT); latchedPage.latch(containerHandle); The api for Page.unlatch() states that the reference used to unlatch the page must be not used (i.e. must be set to null) after calling unlatch(). This code uses the reference used to unlatch the page after the lock call and the caller of this method will continue to use the same Page reference. The reference must be set to null after an unlatch because once the page in unlatched there is no guarantee that the page will remain in cache or that the reference will continue to describe the same page. So this code clearly uses a Page reference after it has been unlatched and thus can be referring to a wrong page or an invlaid page. I think it works with the current code because a a latch on the Page is a super-set of a lock manager Latch, I'll add more on that in a separate comment.
          Hide
          Mike Matrigali added a comment -

          Dan is absolutely right about the unlatch() call. Once unlatch call is made the page is free to be thrown out of the cache and be replaced by a different page, from possibly a different table, different page size ....
          Once you call unlatch you have to reload the buffer manager, the problem is that you can't do it in the interfaces you have
          as the callers still have the old reference. The btree code does this in it's code that manages latch/lock outside of
          lockmanager. For btree it was necessary as after you gave up latch while waiting on lock the row could move off the
          original page.

          For heap rows don't ever move off the page. They can "disappear" in the purge case described above, and in the
          offline compress case which is handled by a table level lock.

          Show
          Mike Matrigali added a comment - Dan is absolutely right about the unlatch() call. Once unlatch call is made the page is free to be thrown out of the cache and be replaced by a different page, from possibly a different table, different page size .... Once you call unlatch you have to reload the buffer manager, the problem is that you can't do it in the interfaces you have as the callers still have the old reference. The btree code does this in it's code that manages latch/lock outside of lockmanager. For btree it was necessary as after you gave up latch while waiting on lock the row could move off the original page. For heap rows don't ever move off the page. They can "disappear" in the purge case described above, and in the offline compress case which is handled by a table level lock.
          Hide
          Mike Matrigali added a comment -

          Dan, do you think the current code is actually allowing another thread to latch the page while the lock is waiting? I used to think it did, but definitely got confused reading through the lock manager.

          Show
          Mike Matrigali added a comment - Dan, do you think the current code is actually allowing another thread to latch the page while the lock is waiting? I used to think it did, but definitely got confused reading through the lock manager.
          Hide
          Daniel John Debrunner added a comment -

          This is how the page latching currently works in the case when the lock manager releases its latch:

          Page page = ContainerHandle.getPage(long)
          results in a call to FileContainer.getPage(...)

          • page is found in the cache and kept (ie. usage count in the cache will be bumped)
          • page is latched
          • BasePage.lockEvent() is called to assign page ownership

          at this point the page is held in the cache and is latched, ContainerHandle.getPage states Page reference will remain valid until unlatch() is called. Note that multiple threads can execute this code concurrently, all will have the same valid reference to the Page, one will have it latched, the others will be queued in the lock manager waiting for the latch to be released.

          LockingPolicy.lockRecordForRead() called to get a row lock passing the Page's latch, assume the lock can not be obtained.

          • Page's latch will be released
          • BasePage.unlockEvent() is called to clear page ownership
          • caller will wait for row lock
            at this point the caller still has not returned the Page object to the cache, so from the cache's point of view the
            Page is still in use and so cannot be re-used with another identity.
          • next waiter is be granted the Page latch and continue their work, ie. their ContainerHandle.getPage(long) returns with the same valid Page reference.
          • at some point the caller gets its row lock and then attempts to get the latch, queuing behind any other waiters
          • once the latch is granted (following the row lock) the lockRecordForRead() call returns
            BasePage.lockEvent() is called to assign page ownership

          Caller reads and/or modifies page

          Page.unlatch() is called to indicate the caller is finished with the page

          • page is unlatched
          • BasePage.unlockEvent() is called to clear page ownership
          • Page is released from the cache (ie. usage count decremented)

          At this point in time the caller cannot trust the Page reference any more, since its usage count in the cache could have dropped to zero and some other thread could have replaced it.

          Show
          Daniel John Debrunner added a comment - This is how the page latching currently works in the case when the lock manager releases its latch: Page page = ContainerHandle.getPage(long) results in a call to FileContainer.getPage(...) page is found in the cache and kept (ie. usage count in the cache will be bumped) page is latched BasePage.lockEvent() is called to assign page ownership at this point the page is held in the cache and is latched, ContainerHandle.getPage states Page reference will remain valid until unlatch() is called. Note that multiple threads can execute this code concurrently, all will have the same valid reference to the Page, one will have it latched, the others will be queued in the lock manager waiting for the latch to be released. LockingPolicy.lockRecordForRead() called to get a row lock passing the Page's latch, assume the lock can not be obtained. Page's latch will be released BasePage.unlockEvent() is called to clear page ownership caller will wait for row lock at this point the caller still has not returned the Page object to the cache, so from the cache's point of view the Page is still in use and so cannot be re-used with another identity. next waiter is be granted the Page latch and continue their work, ie. their ContainerHandle.getPage(long) returns with the same valid Page reference. at some point the caller gets its row lock and then attempts to get the latch, queuing behind any other waiters once the latch is granted (following the row lock) the lockRecordForRead() call returns BasePage.lockEvent() is called to assign page ownership Caller reads and/or modifies page Page.unlatch() is called to indicate the caller is finished with the page page is unlatched BasePage.unlockEvent() is called to clear page ownership Page is released from the cache (ie. usage count decremented) At this point in time the caller cannot trust the Page reference any more, since its usage count in the cache could have dropped to zero and some other thread could have replaced it.
          Hide
          Daniel John Debrunner added a comment -

          I think the patch hit this issue due to some existing confusion with naming in the existing code.

          The top level api currently is
          Page page = ContainerHandle.getPage()
          // do stuff with page
          page.unlatch();

          and then internally the page code uses a lock manager Latch to help manage its lifecyle.

          As latching moves into the page implementation it might make sense to rename concepts to maintain a clear separation.

          Maybe the top-level api could use release() instead of unlatch() and leave the latch concept for the code Knut has implemented.
          Obviously changing/cleaning up the comments in the various .getPage() calls to reflect the new terminology would be required as well.

          Show
          Daniel John Debrunner added a comment - I think the patch hit this issue due to some existing confusion with naming in the existing code. The top level api currently is Page page = ContainerHandle.getPage() // do stuff with page page.unlatch(); and then internally the page code uses a lock manager Latch to help manage its lifecyle. As latching moves into the page implementation it might make sense to rename concepts to maintain a clear separation. Maybe the top-level api could use release() instead of unlatch() and leave the latch concept for the code Knut has implemented. Obviously changing/cleaning up the comments in the various .getPage() calls to reflect the new terminology would be required as well.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for your comments, Dan and Mike. It seems to me that all your
          comments are about the variants of LockingPolicy.lockRecordForRead and
          lockRecordForWrite that take a latch parameter. I share your concern
          for the correctness of this code, and that was actually what made me
          start the thread which is archived at
          <URL:http://thread.gmane.org/gmane.comp.apache.db.derby.devel/33135>.

          But this brings us back to the original question in that thread: When
          is this code used? I believe it was concluded that those methods were
          only called when the locking policy was NoLocking, in which case they
          are no-ops. (With the exception of some unit tests which invoke the
          methods of the Page interface directly.)

          If it is true that this code is never used, perhaps we should consider
          removing it?

          Show
          Knut Anders Hatlen added a comment - Thanks for your comments, Dan and Mike. It seems to me that all your comments are about the variants of LockingPolicy.lockRecordForRead and lockRecordForWrite that take a latch parameter. I share your concern for the correctness of this code, and that was actually what made me start the thread which is archived at <URL: http://thread.gmane.org/gmane.comp.apache.db.derby.devel/33135 >. But this brings us back to the original question in that thread: When is this code used? I believe it was concluded that those methods were only called when the locking policy was NoLocking, in which case they are no-ops. (With the exception of some unit tests which invoke the methods of the Page interface directly.) If it is true that this code is never used, perhaps we should consider removing it?
          Hide
          Daniel John Debrunner added a comment -

          Maybe that's the best approach, leave latching as it is for now, and first work on removing the possibly unused code. Then hopefully after the cleanup your proposed latch code will work as it will no longer have the potential of being released while holding a lock.

          Show
          Daniel John Debrunner added a comment - Maybe that's the best approach, leave latching as it is for now, and first work on removing the possibly unused code. Then hopefully after the cleanup your proposed latch code will work as it will no longer have the potential of being released while holding a lock.
          Hide
          Knut Anders Hatlen added a comment -

          Will do that. I have logged DERBY-2197 to track those changes.

          Show
          Knut Anders Hatlen added a comment - Will do that. I have logged DERBY-2197 to track those changes.
          Hide
          Knut Anders Hatlen added a comment -

          Attached new patch (1c). The only change from 1b is that the unused code is not touched (because it has been removed from trunk). The new patch only touches BasePage.java. Please review.

          Show
          Knut Anders Hatlen added a comment - Attached new patch (1c). The only change from 1b is that the unused code is not touched (because it has been removed from trunk). The new patch only touches BasePage.java. Please review.
          Hide
          Knut Anders Hatlen added a comment -

          Does anyone plan to review the latest version of the patch?

          Show
          Knut Anders Hatlen added a comment - Does anyone plan to review the latest version of the patch?
          Hide
          Knut Anders Hatlen added a comment -

          If there are no more comments, I plan to commit 1c after the weekend.

          Show
          Knut Anders Hatlen added a comment - If there are no more comments, I plan to commit 1c after the weekend.
          Hide
          Suresh Thalamati added a comment -

          Hi Knut,

          I reviewed the latest patch, it looks good to be committed. I have only
          few minor questions/comments :

          1) you may want to fix comments in void setExclusive(BaseContainerHandle requester)

          // because i) lock manager might assume latches are exclusive for
          // performance, ii)

          2)
          + owner = requester;
          + requester.addObserver(this);

          It took me some time to understand how this works on an error cases
          scenarios, basically what happens if a thread after acquiring a latch
          errors out for some reason. My understanding is you are handling
          this cases by by putting the pages on the container observer list
          (requester.addObserver(this)). Please add some comments to why this page
          is added to observer list.

          3)
          +
          + // just deadlock out if a transaction tries to double latch the
          + // page while not in abort

          comment is good. But , you may want to add some assertion or throw
          error here, if this case should not happen.

          4)

          • /** Debugging, print slot table information */
          • protected String slotTableToString()

          I don't know why you removed this method in this patch. May
          be this method is not used or not required. Occasionally I found
          these type of methods useful while debugging a page corruption
          to quickly dump page info with some minor changes to the code.

          5)
          + // Expect notify from releaseExclusive().
          + wait();

          I was wondering , if the wait() method here should be time based to catch
          any infinite waits due to incorrectly missing release latch calls or you think
          it is going to be be unnecessarily expensive ?

          5) Did you find the existing unit tests already tests latching methods ? or
          you are planning to write one.

          Thanks
          -suresh

          Show
          Suresh Thalamati added a comment - Hi Knut, I reviewed the latest patch, it looks good to be committed. I have only few minor questions/comments : 1) you may want to fix comments in void setExclusive(BaseContainerHandle requester) // because i) lock manager might assume latches are exclusive for // performance, ii) 2) + owner = requester; + requester.addObserver(this); It took me some time to understand how this works on an error cases scenarios, basically what happens if a thread after acquiring a latch errors out for some reason. My understanding is you are handling this cases by by putting the pages on the container observer list (requester.addObserver(this)). Please add some comments to why this page is added to observer list. 3) + + // just deadlock out if a transaction tries to double latch the + // page while not in abort comment is good. But , you may want to add some assertion or throw error here, if this case should not happen. 4) /** Debugging, print slot table information */ protected String slotTableToString() I don't know why you removed this method in this patch. May be this method is not used or not required. Occasionally I found these type of methods useful while debugging a page corruption to quickly dump page info with some minor changes to the code. 5) + // Expect notify from releaseExclusive(). + wait(); I was wondering , if the wait() method here should be time based to catch any infinite waits due to incorrectly missing release latch calls or you think it is going to be be unnecessarily expensive ? 5) Did you find the existing unit tests already tests latching methods ? or you are planning to write one. Thanks -suresh
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for reviewing the patch, Suresh! Those are great
          suggestions. I'll try to answer your questions below.

          1) I'll update that comment.

          2) I don't think the observer is used to clean up and release the
          latch if the thread errors out, but rather to release the latch if the
          container is closed/dropped while the page is latched. I'm not sure
          why we need special handling of this situation (I'd assume that the
          general error handling/clean-up would work for this situation too),
          but I wanted to keep as much of the old behaviour as possible. I'll
          see if I can improve the comments.

          If a thread errors out after the page has been latched, I think it is
          the the caller's responsibility to detect the error and unlatch the
          page. At least, it seems like this is the pattern being used in the
          code:

          Page page = getPage(...);
          try

          { // do something with page }

          finally

          { page.unlatch(); }

          3 and 5) I agree that the wait() should be timed. I was planning to
          change it in a followup patch if that sounds OK. What would be a
          reasonable timeout value? Ten seconds?

          With a timed wait, there would be no need for an assert/error in the
          double-latching case, since we would get a timeout error in the end.

          4) I'll keep the method (it is unused, but as you said it could be
          useful for debugging).

          6) There are some unit tests which check that the pages are correctly
          latched (for instance T_RawStoreFactory, T_Recovery, T_b2i). However,
          I don't think any of them test that the latching behaves correctly
          when there's a conflict. I'll see if I can write some tests for
          that. Would it be OK to add such tests in a separate followup patch?

          Show
          Knut Anders Hatlen added a comment - Thanks for reviewing the patch, Suresh! Those are great suggestions. I'll try to answer your questions below. 1) I'll update that comment. 2) I don't think the observer is used to clean up and release the latch if the thread errors out, but rather to release the latch if the container is closed/dropped while the page is latched. I'm not sure why we need special handling of this situation (I'd assume that the general error handling/clean-up would work for this situation too), but I wanted to keep as much of the old behaviour as possible. I'll see if I can improve the comments. If a thread errors out after the page has been latched, I think it is the the caller's responsibility to detect the error and unlatch the page. At least, it seems like this is the pattern being used in the code: Page page = getPage(...); try { // do something with page } finally { page.unlatch(); } 3 and 5) I agree that the wait() should be timed. I was planning to change it in a followup patch if that sounds OK. What would be a reasonable timeout value? Ten seconds? With a timed wait, there would be no need for an assert/error in the double-latching case, since we would get a timeout error in the end. 4) I'll keep the method (it is unused, but as you said it could be useful for debugging). 6) There are some unit tests which check that the pages are correctly latched (for instance T_RawStoreFactory, T_Recovery, T_b2i). However, I don't think any of them test that the latching behaves correctly when there's a conflict. I'll see if I can write some tests for that. Would it be OK to add such tests in a separate followup patch?
          Hide
          Knut Anders Hatlen added a comment -

          New revision of the patch (1d) with the following changes:

          • updated comments as suggested
          • doesn't delete slotTableToString()
          • factored out some common code from setExclusive()/setExclusiveNoWait() into a new method called preLatch()
          Show
          Knut Anders Hatlen added a comment - New revision of the patch (1d) with the following changes: updated comments as suggested doesn't delete slotTableToString() factored out some common code from setExclusive()/setExclusiveNoWait() into a new method called preLatch()
          Hide
          Daniel John Debrunner added a comment -

          I agree that any timeout on the latch wait should be a separate patch & I believe a separate Jira issue.
          For one thing it will change the api to use patches, e.g. this logic would no longer work:

          Page page = getPage(...);
          try

          { // do something with page }

          finally

          { page.unlatch(); }

          as obviously with a timeout there would be a path through the finally block where the page is not latched.

          I'm not sure I see the value of a timeout though, though that discussion should be in any new issue.

          Show
          Daniel John Debrunner added a comment - I agree that any timeout on the latch wait should be a separate patch & I believe a separate Jira issue. For one thing it will change the api to use patches, e.g. this logic would no longer work: Page page = getPage(...); try { // do something with page } finally { page.unlatch(); } as obviously with a timeout there would be a path through the finally block where the page is not latched. I'm not sure I see the value of a timeout though, though that discussion should be in any new issue.
          Hide
          Suresh Thalamati added a comment -

          Please commit the patch, you can address tests or comment fixes in
          separate patches.

          >2) I don't think the observer is used to clean up and release the
          l> atch if the thread errors out, but rather to release the latch if the
          >>container is closed/dropped while the page is latched. I

          My understanding of this is cleanup will tigger abort, or save point
          rollback which will trigger the close of container handle, which inturn
          will trigger release of latches on pages that are on the observer
          list of the container handle.

          I agree with you in most of the cases you will find latches being
          released in the finally{ } blocks. I belive it is good practice too. But
          I am not sure that is case all throughout the code, in any case i belive
          cleanup will release the latches, if there are any of those missing cases.

          >> 3 and 5) I agree that the wait() should be timed. I was planning to
          >change it in a followup patch if that sounds OK. What would be a
          >reasonable timeout value? Ten seconds?
          >>

          I was thinking more like 1/2 hour or more, mainly to get some kind of stack/information
          and avoid the complaints that the derby freezes forever without any clues, especially
          when derby is embedded in another application; It may be the case that there are no
          bugs and will not be any bugs in this area, so it may not be worth adding a timeout.
          I agree with Dan, this can be addressed as separate issue, if needed.

          -suresh

          Show
          Suresh Thalamati added a comment - Please commit the patch, you can address tests or comment fixes in separate patches. >2) I don't think the observer is used to clean up and release the l> atch if the thread errors out, but rather to release the latch if the >>container is closed/dropped while the page is latched. I My understanding of this is cleanup will tigger abort, or save point rollback which will trigger the close of container handle, which inturn will trigger release of latches on pages that are on the observer list of the container handle. I agree with you in most of the cases you will find latches being released in the finally{ } blocks. I belive it is good practice too. But I am not sure that is case all throughout the code, in any case i belive cleanup will release the latches, if there are any of those missing cases. >> 3 and 5) I agree that the wait() should be timed. I was planning to >change it in a followup patch if that sounds OK. What would be a >reasonable timeout value? Ten seconds? >> I was thinking more like 1/2 hour or more, mainly to get some kind of stack/information and avoid the complaints that the derby freezes forever without any clues, especially when derby is embedded in another application; It may be the case that there are no bugs and will not be any bugs in this area, so it may not be worth adding a timeout. I agree with Dan, this can be addressed as separate issue, if needed. -suresh
          Hide
          Knut Anders Hatlen added a comment -

          Thanks! Committed derby-2107-1d.diff with revision 503440.

          Show
          Knut Anders Hatlen added a comment - Thanks! Committed derby-2107-1d.diff with revision 503440.
          Hide
          Knut Anders Hatlen added a comment -

          > I agree that any timeout on the latch wait should be a separate
          > patch & I believe a separate Jira issue. For one thing it will
          > change the api to use patches, e.g. this logic would no longer work:
          >
          > Page page = getPage(...);
          > try

          { > // do something with page > }

          finally

          { > page.unlatch(); > }

          >
          > as obviously with a timeout there would be a path through the
          > finally block where the page is not latched.

          I don't think this will change since the call to getPage() is outside
          the try block. The timeout exception will be thrown inside getPage()
          and the reference to the page object won't be returned.

          > I'm not sure I see the value of a timeout though, though that
          > discussion should be in any new issue.

          I'll file a new issue where we can discuss ways to report/debug
          deadlocks.

          Show
          Knut Anders Hatlen added a comment - > I agree that any timeout on the latch wait should be a separate > patch & I believe a separate Jira issue. For one thing it will > change the api to use patches, e.g. this logic would no longer work: > > Page page = getPage(...); > try { > // do something with page > } finally { > page.unlatch(); > } > > as obviously with a timeout there would be a path through the > finally block where the page is not latched. I don't think this will change since the call to getPage() is outside the try block. The timeout exception will be thrown inside getPage() and the reference to the page object won't be returned. > I'm not sure I see the value of a timeout though, though that > discussion should be in any new issue. I'll file a new issue where we can discuss ways to report/debug deadlocks.
          Hide
          Knut Anders Hatlen added a comment -

          derby-2107-2a.diff removes the now unused latch methods from the lock manager. It also removes one test case from T_LockFactory which tests the methods. Derbyall and the JUnit tests passed.

          Show
          Knut Anders Hatlen added a comment - derby-2107-2a.diff removes the now unused latch methods from the lock manager. It also removes one test case from T_LockFactory which tests the methods. Derbyall and the JUnit tests passed.
          Hide
          Knut Anders Hatlen added a comment -

          Reattaching 2a with "grant license" checked.

          Show
          Knut Anders Hatlen added a comment - Reattaching 2a with "grant license" checked.
          Hide
          Knut Anders Hatlen added a comment -

          Committed 2a with revision 504462.

          Show
          Knut Anders Hatlen added a comment - Committed 2a with revision 504462.
          Hide
          Knut Anders Hatlen added a comment -

          Patch 3a adds a test case to T_RawStoreFactory. The new test case tests that multiple requests for a page block each other out.

          Show
          Knut Anders Hatlen added a comment - Patch 3a adds a test case to T_RawStoreFactory. The new test case tests that multiple requests for a page block each other out.
          Hide
          Knut Anders Hatlen added a comment -

          Committed 3a with revision 506899.

          Show
          Knut Anders Hatlen added a comment - Committed 3a with revision 506899.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch (4a) to address comment #3 from Suresh. It adds an assert which fails when a transaction attempts to latch the same page twice. Also needed to update the test case added by the 3a patch, since it now throws an assert failure instead of hanging when using a sane build. All the tests ran cleanly.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch (4a) to address comment #3 from Suresh. It adds an assert which fails when a transaction attempts to latch the same page twice. Also needed to update the test case added by the 3a patch, since it now throws an assert failure instead of hanging when using a sane build. All the tests ran cleanly.
          Hide
          Knut Anders Hatlen added a comment -

          Committed 4a with revision 532636.
          I think all comments have been addressed (either by fixing them or by opening new JIRAs), so I'm resolving the issue.

          Show
          Knut Anders Hatlen added a comment - Committed 4a with revision 532636. I think all comments have been addressed (either by fixing them or by opening new JIRAs), so I'm resolving the issue.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development