Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.2.2.0, 10.3.1.4, 10.4.2.0
    • Fix Version/s: 10.5.1.1
    • Component/s: Store
    • Labels:
      None
    • Environment:
      Windows XP, Java 6
    • Issue & fix info:
      High Value Fix

      Description

      After doing dome research on the mailing list, it appears that the index split deadlock is a known behaviour, so I will start by describing the theoretical problem first and then follow with the details of my test case.

      If you have concurrent select and insert transactions on the same table, the observed locking behaviour is as follows:

      • the select transaction acquires an S lock on the root block of the index and then waits for an S lock on some uncommitted row of the insert transaction
      • the insert transaction acquires X locks on the inserted records and if it needs to do an index split creates a sub-transaction that tries to acquire an X lock on the root block of the index

      In summary: INDEX LOCK followed by ROW LOCK + ROW LOCK followed by INDEX LOCK = deadlock

      In the case of my project this is an important issue (lack of concurrency after being forced to use table level locking) and I would like to contribute to the project and fix this issue (if possible). I was wondering if someone that knows the code can give me a few pointers on the implications of this issue:

      • Is this a limitation of the top-down algorithm used?
      • Would fixing it require to use a bottom up algorithm for better concurrency (which is certainly non trivial)?
      • Trying to break the circular locking above, I would first question why does the select transaction need to acquire (and hold) a lock on the root block of the index. Would it be possible to ensure the consistency of the select without locking the index?

      The attached test (InsertSelectDeadlock.java) tries to simulate a typical data collection application, it consists of:

      • an insert thread that inserts records in batch
      • a select thread that 'processes' the records inserted by the other thread: 'select * from table where id > ?'

      The derby log provides detail about the deadlock trace and stacktraces_during_deadlock.txt shows that the inser thread is doing an index split.

      The test was run on 10.2.2.0 and 10.3.1.4 with identical behaviour.

      Thanks,

      Bogdan Calmac.

      1. test-4.diff
        0.8 kB
        Knut Anders Hatlen
      2. test-3.diff
        12 kB
        Knut Anders Hatlen
      3. test-2.diff
        9 kB
        Knut Anders Hatlen
      4. test-1.diff
        12 kB
        Knut Anders Hatlen
      5. stacktraces_during_deadlock.txt
        5 kB
        Bogdan Calmac
      6. Repro2991.java
        4 kB
        Kurt Huwig
      7. perftest.diff
        10 kB
        Knut Anders Hatlen
      8. InsertSelectDeadlock.java
        6 kB
        Bogdan Calmac
      9. derby.log
        45 kB
        Bogdan Calmac
      10. d2991-preview-1e.diff
        269 kB
        Knut Anders Hatlen
      11. d2991-preview-1d.stat
        2 kB
        Knut Anders Hatlen
      12. d2991-preview-1d.diff
        265 kB
        Knut Anders Hatlen
      13. d2991-preview-1c.stat
        2 kB
        Knut Anders Hatlen
      14. d2991-preview-1c.diff
        261 kB
        Knut Anders Hatlen
      15. d2991-preview-1b.stat
        1 kB
        Knut Anders Hatlen
      16. d2991-preview-1b.diff
        225 kB
        Knut Anders Hatlen
      17. d2991-preview-1a.stat
        0.4 kB
        Knut Anders Hatlen
      18. d2991-preview-1a.diff
        17 kB
        Knut Anders Hatlen
      19. d2991-2b.stat
        2 kB
        Knut Anders Hatlen
      20. d2991-2b.diff
        444 kB
        Knut Anders Hatlen
      21. d2991-2a.stat
        2 kB
        Knut Anders Hatlen
      22. d2991-2a.diff
        436 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Bogdan Calmac added a comment -

          The test demonstrationg the deadlock

          Show
          Bogdan Calmac added a comment - The test demonstrationg the deadlock
          Hide
          Bogdan Calmac added a comment -

          The deadlock trace

          Show
          Bogdan Calmac added a comment - The deadlock trace
          Hide
          Bogdan Calmac added a comment -

          The stacktrace before deadlock indicating an index split

          Show
          Bogdan Calmac added a comment - The stacktrace before deadlock indicating an index split
          Hide
          John H. Embretsen added a comment -

          This is an interesting case which reminds me (but is different from) a deadlock issue I encountered in a long-running test not so long ago. I was not able to reproduce the deadlock, but we suspect that there may be a bug in the code that creates the deadlock message about the lock cycles, see DERBY-2877 for details.

          This test (InsertSelectDeadlock.java) is interesting because you get a lock timeout message rather than a deadlock message (even if you add a deadlockTimeout property with a lower value than the waitTimeout property - I've tried), although in practice this is a deadlock.

          Also, even if derby.language.logStatementText is set to true (I edited the test class to do that), you don't get any more information about the offending statement which (presumably) holds an exclusive lock on the index root (1,1).

          Here's the relevant information extracted from the derby.log Bogdan attached:

          XID |TYPE |MODE|LOCKCOUNT|LOCKNAME|STATE|INDEXNAME |
          ---------------------------------------------------------------------------

              • The following row is the victim ***
                279 |ROW |S |0 |(1,7) |WAIT |NULL |
              • The above row is the victim ***
                269 |ROW |X |1 |(1,7) |GRANT|NULL |
                282 |ROW |X |0 |(1,1) |WAIT |SQL070802115706620 |
                279 |ROW |S |1 |(1,1) |GRANT|SQL070802115706620 |

          (I excluded things like tabletype (always T) and tablename (always TRACK_EVENT)).

          The transactions above are probably:

          XID 279: select * from TRACK_EVENT where ID > ?
          XID 269: insert into TRACK_EVENT values (?,?,?,?,?,?,?,?,?,?,?,?,?,?)
          XID 282: [index btree split?]

          We are assuming (based on the stack traces) that XID 269 has spawned another transaction, XID 282, which is trying to modify the index, but is not allowed to because XID 279 is already holding an S (Shared) lock on it. I guess derby does not recognize this as a deadlock because the index-modifying transaction has a completely different transaction number. So, in systems that don't have deadlockTrace enabled, deadlocks may be camouflaged as lock wait timeouts.

          I was kind of hoping that this comment would help so that more people could understand these issues, hence increasing the chances of getting them fixed (or, at the least, improving related documentation). I'm sorry for not being able to work out a solution or provide answers to Bogdan's questions at this point, but I'm hoping someone else will be assisting soon.

          Show
          John H. Embretsen added a comment - This is an interesting case which reminds me (but is different from) a deadlock issue I encountered in a long-running test not so long ago. I was not able to reproduce the deadlock, but we suspect that there may be a bug in the code that creates the deadlock message about the lock cycles, see DERBY-2877 for details. This test (InsertSelectDeadlock.java) is interesting because you get a lock timeout message rather than a deadlock message (even if you add a deadlockTimeout property with a lower value than the waitTimeout property - I've tried), although in practice this is a deadlock. Also, even if derby.language.logStatementText is set to true (I edited the test class to do that), you don't get any more information about the offending statement which (presumably) holds an exclusive lock on the index root (1,1). Here's the relevant information extracted from the derby.log Bogdan attached: XID |TYPE |MODE|LOCKCOUNT|LOCKNAME|STATE|INDEXNAME | --------------------------------------------------------------------------- The following row is the victim *** 279 |ROW |S |0 |(1,7) |WAIT |NULL | The above row is the victim *** 269 |ROW |X |1 |(1,7) |GRANT|NULL | 282 |ROW |X |0 |(1,1) |WAIT |SQL070802115706620 | 279 |ROW |S |1 |(1,1) |GRANT|SQL070802115706620 | (I excluded things like tabletype (always T) and tablename (always TRACK_EVENT)). The transactions above are probably: XID 279: select * from TRACK_EVENT where ID > ? XID 269: insert into TRACK_EVENT values (?,?,?,?,?,?,?,?,?,?,?,?,?,?) XID 282: [index btree split?] We are assuming (based on the stack traces) that XID 269 has spawned another transaction, XID 282, which is trying to modify the index, but is not allowed to because XID 279 is already holding an S (Shared) lock on it. I guess derby does not recognize this as a deadlock because the index-modifying transaction has a completely different transaction number. So, in systems that don't have deadlockTrace enabled, deadlocks may be camouflaged as lock wait timeouts. I was kind of hoping that this comment would help so that more people could understand these issues, hence increasing the chances of getting them fixed (or, at the least, improving related documentation). I'm sorry for not being able to work out a solution or provide answers to Bogdan's questions at this point, but I'm hoping someone else will be assisting soon.
          Hide
          Knut Anders Hatlen added a comment -

          > - Trying to break the circular locking above, I would first question
          > why does the select transaction need to acquire (and hold) a lock
          > on the root block of the index.

          B-tree scans lock the leaf node on which they are currently positioned
          (in your case the root node is probably also a leaf node) so that they
          don't have to reposition if they lose the latch on the page. The scan
          always releases the latch on the B-tree leaf if it has to wait for a
          row lock. The latch is released in order to prevent deadlocks. Since
          the page is not latched while the scan waits for the row lock, the row
          might have moved to another page in the index and therefore the scan
          would have to start a new search to reposition on the row. The shared
          lock on the B-tree leaf ensures that the row is not moved off the
          current page so that the repositioning isn't needed. Since the lock on
          the leaf node is shared, it allows more concurrency than if the
          (exclusive) latch were kept. But since the lock on the leaf is more or
          less like a shared latch, there's still a possibility for deadlocks if
          the B-tree needs to be restructured (like the split deadlock you're
          seeing).

          > Would it be possible to ensure the consistency of the select without
          > locking the index?

          I think so, if the select transaction had a way to reposition the scan
          after it got the row lock it was waiting for. The current code just
          checks the leaf page again to find the row, since it knows it can't
          have moved to another page. If there weren't a lock on the index page,
          the scan would need to detect that the row had moved and start a new
          search from the root node.

          I don't know enough about this area of the code to say whether or not
          it is doable, or how much work it is. Hopefully, someone more
          knowledgeable will chime in with more details.

          Show
          Knut Anders Hatlen added a comment - > - Trying to break the circular locking above, I would first question > why does the select transaction need to acquire (and hold) a lock > on the root block of the index. B-tree scans lock the leaf node on which they are currently positioned (in your case the root node is probably also a leaf node) so that they don't have to reposition if they lose the latch on the page. The scan always releases the latch on the B-tree leaf if it has to wait for a row lock. The latch is released in order to prevent deadlocks. Since the page is not latched while the scan waits for the row lock, the row might have moved to another page in the index and therefore the scan would have to start a new search to reposition on the row. The shared lock on the B-tree leaf ensures that the row is not moved off the current page so that the repositioning isn't needed. Since the lock on the leaf node is shared, it allows more concurrency than if the (exclusive) latch were kept. But since the lock on the leaf is more or less like a shared latch, there's still a possibility for deadlocks if the B-tree needs to be restructured (like the split deadlock you're seeing). > Would it be possible to ensure the consistency of the select without > locking the index? I think so, if the select transaction had a way to reposition the scan after it got the row lock it was waiting for. The current code just checks the leaf page again to find the row, since it knows it can't have moved to another page. If there weren't a lock on the index page, the scan would need to detect that the row had moved and start a new search from the root node. I don't know enough about this area of the code to say whether or not it is doable, or how much work it is. Hopefully, someone more knowledgeable will chime in with more details.
          Hide
          Kurt Huwig added a comment -

          Happens with 10.3.1.4 as well

          Show
          Kurt Huwig added a comment - Happens with 10.3.1.4 as well
          Hide
          Kurt Huwig added a comment -

          Simplified testcase with only one column.

          Show
          Kurt Huwig added a comment - Simplified testcase with only one column.
          Hide
          Kurt Huwig added a comment -

          I've got the same problem nearly every day in my application, therefore I simplified the example file from the mailinglist. I did also try it with MySQL without a deadlock.

          Show
          Kurt Huwig added a comment - I've got the same problem nearly every day in my application, therefore I simplified the example file from the mailinglist. I did also try it with MySQL without a deadlock.
          Show
          John H. Embretsen added a comment - Related mailing list threads on derby-user, primo August 2007: http://www.nabble.com/lock-escalation-and-deadlocks-t4197785.html http://www.nabble.com/Concurrent-select-and-insert-deadlock-on-index-t4204088.html
          Hide
          Bogdan Calmac added a comment -

          The first thread (http://www.nabble.com/lock-escalation-and-deadlocks-t4197785.html ) is just my wrong interpretation of the cause, so can be discarded.

          Show
          Bogdan Calmac added a comment - The first thread ( http://www.nabble.com/lock-escalation-and-deadlocks-t4197785.html ) is just my wrong interpretation of the cause, so can be discarded.
          Hide
          Tomohito Nakayama added a comment - - edited

          Hello.

          Reading the code of Repro program,
          I found the situation that insert thread inserts multiple rows in a transaction for times
          while select thread selects rows in a range condition results in scanning index for same table.

          I think the problem is that
          inserting multiple row in a transaction make multiple locks for data rows and index rows ,
          and then,
          those locks causes deadlock between selecting same table,
          because operation of selecting rows for range condition also needs locks for data rows and index rows in scanned range.

          I think there exists two ways to walk around for this situation.

          1:
          Do not insert multiple rows in a transaction,
          when selecting rows of same table in a range condition simultaneously,
          as the Repro program.
          I tried changing tracksPerBatch as 1 and
          the deadlock problem of Repro program was resolved.

          2:
          Loosen transaction isolation level.
          I tried conn.setTransactionIsolation( Connection.TRANSACTION_READ_UNCOMMITTED ) and
          the deadlock problem of Repro program was resolved.

          I'm not sure those walk around is suitable for your case,
          but I hope it helps you.

          I also think feature like hint for sql may be helpful in those case.
          If we can suppress the use of index for select operation with hint,
          we could also escape the problem using that feature.

          Show
          Tomohito Nakayama added a comment - - edited Hello. Reading the code of Repro program, I found the situation that insert thread inserts multiple rows in a transaction for times while select thread selects rows in a range condition results in scanning index for same table. I think the problem is that inserting multiple row in a transaction make multiple locks for data rows and index rows , and then, those locks causes deadlock between selecting same table, because operation of selecting rows for range condition also needs locks for data rows and index rows in scanned range. I think there exists two ways to walk around for this situation. 1: Do not insert multiple rows in a transaction, when selecting rows of same table in a range condition simultaneously, as the Repro program. I tried changing tracksPerBatch as 1 and the deadlock problem of Repro program was resolved. 2: Loosen transaction isolation level. I tried conn.setTransactionIsolation( Connection.TRANSACTION_READ_UNCOMMITTED ) and the deadlock problem of Repro program was resolved. I'm not sure those walk around is suitable for your case, but I hope it helps you. I also think feature like hint for sql may be helpful in those case. If we can suppress the use of index for select operation with hint, we could also escape the problem using that feature.
          Hide
          Kurt Huwig added a comment -

          I Tomohito,

          in my application, every insert is done on its own and the isolation is set to READ_UNCOMMITTED. Still the very same deadlock happens, although it needs several days and thousands of inserts and selects to manifest.

          The main difference of my application is, that the database is embedded, but accessed embedded and via the network. Do you think it would solve the problems if I'd only use the network access? Or does there is no difference?

          Show
          Kurt Huwig added a comment - I Tomohito, in my application, every insert is done on its own and the isolation is set to READ_UNCOMMITTED. Still the very same deadlock happens, although it needs several days and thousands of inserts and selects to manifest. The main difference of my application is, that the database is embedded, but accessed embedded and via the network. Do you think it would solve the problems if I'd only use the network access? Or does there is no difference?
          Hide
          Tomohito Nakayama added a comment -

          Well ....
          I set transaction isolation level of both select and insert operations to TRANSACTION_READ_UNCOMMITTED.
          In that situation, I didn't see dead lock problem.

          I'm not sure your situation exactly other than uploaded program...

          I don't think your problem have something to do with embedded or network server/client,
          though I don't think using embedded mode via network file sharing system is good practice ....

          Show
          Tomohito Nakayama added a comment - Well .... I set transaction isolation level of both select and insert operations to TRANSACTION_READ_UNCOMMITTED. In that situation, I didn't see dead lock problem. I'm not sure your situation exactly other than uploaded program... I don't think your problem have something to do with embedded or network server/client, though I don't think using embedded mode via network file sharing system is good practice ....
          Hide
          Kristian Waagan added a comment -

          As a data point for Kurts last comment/question, I don't think accessing the database using only the client driver matters either.
          I do sometimes see the deadlock in an application where all clients are using the client driver.
          In this application only one row is inserted per transaction as well, but in addition there is a multi-row select and a single-row delete.

          In my case I was able to tune other parts of the application to work around the problem, but I am unable to tell exactly what caused the problem.
          For your information, it is running with READ_COMMITTED (or REPEATABLE_READ), the load could be said to be medium+ and the table itself is normally pretty small (less than 1000 rows, can be down to a few hundreds).

          Show
          Kristian Waagan added a comment - As a data point for Kurts last comment/question, I don't think accessing the database using only the client driver matters either. I do sometimes see the deadlock in an application where all clients are using the client driver. In this application only one row is inserted per transaction as well, but in addition there is a multi-row select and a single-row delete. In my case I was able to tune other parts of the application to work around the problem, but I am unable to tell exactly what caused the problem. For your information, it is running with READ_COMMITTED (or REPEATABLE_READ), the load could be said to be medium+ and the table itself is normally pretty small (less than 1000 rows, can be down to a few hundreds).
          Hide
          Tomohito Nakayama added a comment -

          My recognition for this issue is that
          Derby uses locks for table and index as a way to realize isolation of transaction and
          cares needs to be taken for those locks.

          I found information around lock for index at next url.
          http://db.apache.org/derby/papers/btree_package.html

          Show
          Tomohito Nakayama added a comment - My recognition for this issue is that Derby uses locks for table and index as a way to realize isolation of transaction and cares needs to be taken for those locks. I found information around lock for index at next url. http://db.apache.org/derby/papers/btree_package.html
          Hide
          Øystein Grøvlen added a comment -

          See the analysis made by Knut Anders above (17/Aug/07). This is a deadlock between an index scan that has shared locked a special row of the page to avoid that the row it is waiting for is moved to another page. When another transaction needs to split the page, it creates a deadlock since the splitter needs this special lock in order to perform the split. Hence, I think all isolation levels except READ UNCOMMITTED will have this problem. (Since for all other isolation levels the scanner will have to wait for the exclusive lock of the inserter).

          As Knut Anders say, this deadlock could be avoided if the scanner did not lock this special row, but instead had some way to detect that the row has moved, and then renavigate from the root in such cases. I would think something similar happens during rollback where an index row may have moved due to page splits between do and undo. Maybe something similar could be done here. I think there are parts to this:

          1. When a scan needs to wait for a lock, instead of locking this special row, it copies the key of the row so that when it has acquired the lock it is waiting for, it can restart the scan using this key.

          2. In order to detect when a renavigation is needed, the page needs to have some kind of state id that indicates whether a renavigation may be needed. That is, should some row move off the page or in other ways have their address change, the state id will be incremented. This way the scanner, by copying the current state id before it is suspended, can detect whether the page has changed. (One need to consider whether this state id is something that needs to be persisted on the page or whether it can just reside in memory. It depends on whether it may occur that the page is no longer in the page cache when the scan resumes.)

          Show
          Øystein Grøvlen added a comment - See the analysis made by Knut Anders above (17/Aug/07). This is a deadlock between an index scan that has shared locked a special row of the page to avoid that the row it is waiting for is moved to another page. When another transaction needs to split the page, it creates a deadlock since the splitter needs this special lock in order to perform the split. Hence, I think all isolation levels except READ UNCOMMITTED will have this problem. (Since for all other isolation levels the scanner will have to wait for the exclusive lock of the inserter). As Knut Anders say, this deadlock could be avoided if the scanner did not lock this special row, but instead had some way to detect that the row has moved, and then renavigate from the root in such cases. I would think something similar happens during rollback where an index row may have moved due to page splits between do and undo. Maybe something similar could be done here. I think there are parts to this: 1. When a scan needs to wait for a lock, instead of locking this special row, it copies the key of the row so that when it has acquired the lock it is waiting for, it can restart the scan using this key. 2. In order to detect when a renavigation is needed, the page needs to have some kind of state id that indicates whether a renavigation may be needed. That is, should some row move off the page or in other ways have their address change, the state id will be incremented. This way the scanner, by copying the current state id before it is suspended, can detect whether the page has changed. (One need to consider whether this state id is something that needs to be persisted on the page or whether it can just reside in memory. It depends on whether it may occur that the page is no longer in the page cache when the scan resumes.)
          Hide
          Kurt Huwig added a comment -

          Sorry for the noise, but my application even gets the deadlock with

          Connection.TRANSACTION_READ_UNCOMMITTED

          Unfortunately it is a mail cluster and it only gets this with 200+ simultaneous mail connections after several hours/days, so it is somehow hard to create a reproducing example. But I get it every second day, so I can easily verify any fix.

          Show
          Kurt Huwig added a comment - Sorry for the noise, but my application even gets the deadlock with Connection.TRANSACTION_READ_UNCOMMITTED Unfortunately it is a mail cluster and it only gets this with 200+ simultaneous mail connections after several hours/days, so it is somehow hard to create a reproducing example. But I get it every second day, so I can easily verify any fix.
          Hide
          Tomohito Nakayama added a comment -

          I might be bound by idea of lock for scanning index ...
          My concern is about situation when the range of index, where are already scanned, are restructured.

          That situation would result
          that rows fetched before the restructuring operation done are not based on restructured index
          though rows fetched after restructuring operation done are based on.

          Thinking that, I think lock is needed for index to be scanned, at least if transaction isolation level is Serializable.

          Show
          Tomohito Nakayama added a comment - I might be bound by idea of lock for scanning index ... My concern is about situation when the range of index, where are already scanned, are restructured. That situation would result that rows fetched before the restructuring operation done are not based on restructured index though rows fetched after restructuring operation done are based on. Thinking that, I think lock is needed for index to be scanned, at least if transaction isolation level is Serializable.
          Hide
          Tomohito Nakayama added a comment -

          I found what I was confused ...

          Lock for scanning index is held only for current position.
          I confused as locks are held for not only current position but also scanned range.

          Reading code of BTreeScan#posotionAtNextPage,
          lock for current page are unlocked when proceeding to next page.

          Show
          Tomohito Nakayama added a comment - I found what I was confused ... Lock for scanning index is held only for current position . I confused as locks are held for not only current position but also scanned range. Reading code of BTreeScan#posotionAtNextPage, lock for current page are unlocked when proceeding to next page.
          Hide
          Tomohito Nakayama added a comment -

          > One need to consider whether this state id is something that needs to be persisted on the page or whether it can just reside in memory. It depends on whether it may occur that the page is no longer in the page cache when the scan resumes.

          It would be needed for thestate id discussed here to be persisted.
          According to the code of StoredPage#releaseExclusive() method, page cache are removed from CacheManager when the page is unlatched.

          org.apache.derby.impl.store.raw.data.StoredPage#releaseExclusive() :
          /**

          • Ensure that the page is released from the cache when it is unlatched.
            *
          • @see org.apache.derby.impl.store.raw.data.BasePage#releaseExclusive
            *
            **/
            protected void releaseExclusive() { super.releaseExclusive(); pageCache.release(this); }
          Show
          Tomohito Nakayama added a comment - > One need to consider whether this state id is something that needs to be persisted on the page or whether it can just reside in memory. It depends on whether it may occur that the page is no longer in the page cache when the scan resumes. It would be needed for thestate id discussed here to be persisted. According to the code of StoredPage#releaseExclusive() method, page cache are removed from CacheManager when the page is unlatched. org.apache.derby.impl.store.raw.data.StoredPage#releaseExclusive() : /** Ensure that the page is released from the cache when it is unlatched. * @see org.apache.derby.impl.store.raw.data.BasePage#releaseExclusive * **/ protected void releaseExclusive() { super.releaseExclusive(); pageCache.release(this); }
          Hide
          Knut Anders Hatlen added a comment -

          The page is not removed from the page cache when it is unlatched, but its keep count is decremented, so it is possible that someone else has removed it when the scan is resumed.

          BasePage already has a pageVersion field, but it's only incremented when a logged action is performed on the page. I assume there also are unlogged actions that may move the row off the page?

          Show
          Knut Anders Hatlen added a comment - The page is not removed from the page cache when it is unlatched, but its keep count is decremented, so it is possible that someone else has removed it when the scan is resumed. BasePage already has a pageVersion field, but it's only incremented when a logged action is performed on the page. I assume there also are unlogged actions that may move the row off the page?
          Hide
          Tomohito Nakayama added a comment -

          I found methods of StoredPage, which update contents of page, call logAction method which also bump pageVersion value.
          Because index is in stored page, I think any actions which move the row off the page are handled using those methods in StoredPage and are logged.

          Show
          Tomohito Nakayama added a comment - I found methods of StoredPage, which update contents of page, call logAction method which also bump pageVersion value. Because index is in stored page, I think any actions which move the row off the page are handled using those methods in StoredPage and are logged.
          Hide
          Øystein Grøvlen added a comment -

          I think pageVersion is bumped on all (logged) changes to the page. The ideal here would be a version number that is only bumped when a row is moved. Otherwise, most re-navigations would happen when not necessary. Given that an exclusive lock normally leads to the an update, one might as well do an unconditional re-navigation after having waited instead of basing it on pageVersion.

          Show
          Øystein Grøvlen added a comment - I think pageVersion is bumped on all (logged) changes to the page. The ideal here would be a version number that is only bumped when a row is moved. Otherwise, most re-navigations would happen when not necessary. Given that an exclusive lock normally leads to the an update, one might as well do an unconditional re-navigation after having waited instead of basing it on pageVersion.
          Hide
          Knut Anders Hatlen added a comment -

          Could we then have a variable which contains the last page version on which a row was moved, but not store the value to disk? Then the waiter could do something like

          long savedVersion = pageVersion;
          waitForLock();
          if (lastRowMove == UNINITIALIZED)

          { // page was evicted from the page cache while we were waiting, // row might have moved renavigate(); }

          else if (lastRowMove > savedVersion)

          { // a row has been moved while we were waiting renavigate(); }

          else

          { // row has not moved }

          Then we only renavigate when a row has moved or the page has been removed from the cache.

          Show
          Knut Anders Hatlen added a comment - Could we then have a variable which contains the last page version on which a row was moved, but not store the value to disk? Then the waiter could do something like long savedVersion = pageVersion; waitForLock(); if (lastRowMove == UNINITIALIZED) { // page was evicted from the page cache while we were waiting, // row might have moved renavigate(); } else if (lastRowMove > savedVersion) { // a row has been moved while we were waiting renavigate(); } else { // row has not moved } Then we only renavigate when a row has moved or the page has been removed from the cache.
          Hide
          Øystein Grøvlen added a comment -

          Knut Anders Hatlen (JIRA) wrote:

          > Then we only renavigate when a row has moved or the page has been removed from the cache.

          I think that is a good idea.

          Show
          Øystein Grøvlen added a comment - Knut Anders Hatlen (JIRA) wrote: > Then we only renavigate when a row has moved or the page has been removed from the cache. I think that is a good idea.
          Hide
          Tomohito Nakayama added a comment -

          I think lastRowMove is needed to be instance variable of page cache object ...
          Here we encounter removed cache problem again ....

          Show
          Tomohito Nakayama added a comment - I think lastRowMove is needed to be instance variable of page cache object ... Here we encounter removed cache problem again ....
          Hide
          Tomohito Nakayama added a comment -

          Oh I understand.
          Taking removal of page cache as need for renavigate.

          Show
          Tomohito Nakayama added a comment - Oh I understand. Taking removal of page cache as need for renavigate.
          Hide
          Tomohito Nakayama added a comment -

          I found there exists confusion yet...

          > long savedVersion = pageVersion;
          > waitForLock();
          > if (lastRowMove == UNINITIALIZED)

          { > // page was evicted from the page cache while we were waiting, > // row might have moved > renavigate(); > }

          else if (lastRowMove > savedVersion)

          { > // a row has been moved while we were waiting > renavigate(); > }

          else

          { > // row has not moved > }

          > waitForLock();

          We were thinking about not locking the row.
          Then I think this plan is impossible.

          Show
          Tomohito Nakayama added a comment - I found there exists confusion yet... > long savedVersion = pageVersion; > waitForLock(); > if (lastRowMove == UNINITIALIZED) { > // page was evicted from the page cache while we were waiting, > // row might have moved > renavigate(); > } else if (lastRowMove > savedVersion) { > // a row has been moved while we were waiting > renavigate(); > } else { > // row has not moved > } > waitForLock(); We were thinking about not locking the row. Then I think this plan is impossible.
          Hide
          Tomohito Nakayama added a comment -

          I read current code (and Knuts comment) that shared lock is gotten instead of latch of page and
          the chance to release the latch are given to escape deadlock through trial for lock
          in order to escape deadlock.

          If we don't get shared lock any more in program, we can do for escaping deadlock is just release the latch.

          Releasing the latch may results the StoredPage, which also means latch for Page, to be removed.

          If the PageCache is not removed, we can use lastRowMove of StoredPage as above.
          If the PageCache is removed, there are three cases.
          1: Only scanning index operation held latch and PageCache was removed. The row is not moved.
          2: Other operation also held latch and move the row and release the latch and PageCache was removed.
          3: Other operation also held latch and does not move the row but store something and release the latch and PageCache was removed.

          Seeing page version of loaded PageCache, we can distinguish 1 from others.
          I think it would be not so awful to check rows in page to know 2 or 3.

          Show
          Tomohito Nakayama added a comment - I read current code (and Knuts comment) that shared lock is gotten instead of latch of page and the chance to release the latch are given to escape deadlock through trial for lock in order to escape deadlock. If we don't get shared lock any more in program, we can do for escaping deadlock is just release the latch. Releasing the latch may results the StoredPage, which also means latch for Page, to be removed. If the PageCache is not removed, we can use lastRowMove of StoredPage as above. If the PageCache is removed, there are three cases. 1: Only scanning index operation held latch and PageCache was removed. The row is not moved. 2: Other operation also held latch and move the row and release the latch and PageCache was removed. 3: Other operation also held latch and does not move the row but store something and release the latch and PageCache was removed. Seeing page version of loaded PageCache, we can distinguish 1 from others. I think it would be not so awful to check rows in page to know 2 or 3.
          Hide
          Tomohito Nakayama added a comment -

          I think code would be like this.
          I think we also need to see contents of page if cache was removed once.

          long savedVersion = pageVersion;
          page.unlatch()
          page = null;
          <snip>

          //Where need to see the row
          page = getFromCache()

          if (page == null) {
          page = getFromStorage()

          if( page.pageVersion == savedVersion )

          { // row has not moved }else if( checkRowMoved(page) )
          renavigate();
          }else{ // row has not moved }

          } else if(page.lastRowMove == UNINITIALIZED ){ // This can be true if cache of page was removed once and loaded again.
          if( checkRowMoved(page))

          { // The cache was removed after row moved. renavigate(); }

          else

          { // row has not moved }

          } else if (page.lastRowMove > savedVersion)

          { // a row has been moved while we were waiting renavigate(); }

          else

          { // row has not moved }

          Show
          Tomohito Nakayama added a comment - I think code would be like this. I think we also need to see contents of page if cache was removed once. long savedVersion = pageVersion; page.unlatch() page = null; <snip> //Where need to see the row page = getFromCache() if (page == null) { page = getFromStorage() if( page.pageVersion == savedVersion ) { // row has not moved }else if( checkRowMoved(page) ) renavigate(); }else{ // row has not moved } } else if(page.lastRowMove == UNINITIALIZED ){ // This can be true if cache of page was removed once and loaded again. if( checkRowMoved(page)) { // The cache was removed after row moved. renavigate(); } else { // row has not moved } } else if (page.lastRowMove > savedVersion) { // a row has been moved while we were waiting renavigate(); } else { // row has not moved }
          Hide
          Knut Anders Hatlen added a comment -

          I'm hoping to find some time to investigate this issue.

          So far, only one suggestion for how we can fix this has come up. The
          basic idea is that we don't set the scan lock to protect the scan
          position, and instead we reposition the scan if we need to wait for a
          lock and therefore have released the latch on the index leaf.

          Most of the discussion has been about how we can avoid the
          repositioning when it's not needed, and I think we have a reasonably
          good understanding of how we can do that by having a (non-persistent)
          field in the page objects that give us a version number after which it
          is guaranteed that no row has moved off the page.

          Earlier in the discussion we have said that this field should have a
          special value to tell that it's uninitialized when the page is faulted
          in to the page cache. After some more thinking, I now believe it would
          be more reasonable to initialize it to the current version of the
          page. Otherwise, the field would be uninitialized for most of the
          cached pages, and it wouldn't help us avoiding repositioning most of
          the time.

          What hasn't been discussed yet, is how we can do the repositioning
          when we don't have the scan lock. Øystein mentioned that the scan
          could copy the key when it needs to wait for a lock, and use the key
          to reposition itself in the B-tree. I think that sounds like a good
          approach, but there are some things that are not quite clear to me:

          1) Would copying the key work for non-unique indexes as well? I guess
          so since the non-unique indexes seem to append the RecordId to the
          key.

          2) How stable are the RecordIds? Can they be changed by other
          operations than compress table?

          3) If the isolation level is READ_COMMITTED or higher, the open scans
          would have an intentional read lock on the table, so even without the
          scan locks compress table shouldn't be able to change the RecordIds
          while a scan is waiting for a lock (the intentional read lock should
          block compress table when it attempts to write lock the table). But
          would this work in READ_UNCOMMITTED? Or rather, does it work correctly
          in READ_UNCOMMITTED today? READ_UNCOMMITTED doesn't set shared locks,
          so I would assume that it doesn't set any shared scan locks either.

          Comments and suggestions would be appreciated. I'll do some more
          digging and report back with my findings.

          Show
          Knut Anders Hatlen added a comment - I'm hoping to find some time to investigate this issue. So far, only one suggestion for how we can fix this has come up. The basic idea is that we don't set the scan lock to protect the scan position, and instead we reposition the scan if we need to wait for a lock and therefore have released the latch on the index leaf. Most of the discussion has been about how we can avoid the repositioning when it's not needed, and I think we have a reasonably good understanding of how we can do that by having a (non-persistent) field in the page objects that give us a version number after which it is guaranteed that no row has moved off the page. Earlier in the discussion we have said that this field should have a special value to tell that it's uninitialized when the page is faulted in to the page cache. After some more thinking, I now believe it would be more reasonable to initialize it to the current version of the page. Otherwise, the field would be uninitialized for most of the cached pages, and it wouldn't help us avoiding repositioning most of the time. What hasn't been discussed yet, is how we can do the repositioning when we don't have the scan lock. Øystein mentioned that the scan could copy the key when it needs to wait for a lock, and use the key to reposition itself in the B-tree. I think that sounds like a good approach, but there are some things that are not quite clear to me: 1) Would copying the key work for non-unique indexes as well? I guess so since the non-unique indexes seem to append the RecordId to the key. 2) How stable are the RecordIds? Can they be changed by other operations than compress table? 3) If the isolation level is READ_COMMITTED or higher, the open scans would have an intentional read lock on the table, so even without the scan locks compress table shouldn't be able to change the RecordIds while a scan is waiting for a lock (the intentional read lock should block compress table when it attempts to write lock the table). But would this work in READ_UNCOMMITTED? Or rather, does it work correctly in READ_UNCOMMITTED today? READ_UNCOMMITTED doesn't set shared locks, so I would assume that it doesn't set any shared scan locks either. Comments and suggestions would be appreciated. I'll do some more digging and report back with my findings.
          Hide
          Knut Anders Hatlen added a comment -

          It looks like READ_UNCOMMITTED transactions also obtain an IS lock on the tables they are scanning, and if they're performing an index scan an S lock on the scan control row. I think this means that the RecordId will not change because of compress table performed in another transaction while there's an open scan. I'm not sure if that's always true if inplace compress table is performed in the same transaction, but it seems like at least the purging is performed in a nested transaction and will time out on getting an exclusive lock if the parent transaction has an open scan on that table.

          Show
          Knut Anders Hatlen added a comment - It looks like READ_UNCOMMITTED transactions also obtain an IS lock on the tables they are scanning, and if they're performing an index scan an S lock on the scan control row. I think this means that the RecordId will not change because of compress table performed in another transaction while there's an open scan. I'm not sure if that's always true if inplace compress table is performed in the same transaction, but it seems like at least the purging is performed in a nested transaction and will time out on getting an exclusive lock if the parent transaction has an open scan on that table.
          Hide
          Knut Anders Hatlen added a comment -

          We already have code to save the position by key and reposition using
          the saved key (methods savePosition() and reposition() in
          BTreeScan). Currently, we only save the key on transaction borders so
          that we can reposition a holdable cursor after a commit. This is
          needed because transactions release all their locks on commit, scan
          locks included.

          I think it should be possible to use the same logic to save the
          position within a transaction. The current position of an index scan
          is stored in a BTreeRowPosition object, either as a Page/RecordId pair
          which requires a scan protection lock, or as a key value. Every time
          the access layer is reentered, reposition() is called which either
          just reacquires the latch on the page (if there's a scan lock) or
          repositions from the root of the B-tree (if there's no scan lock).

          So I think that if we find a way to ensure that we always save the
          position by key when we're about to release the latch on a leaf page
          (either because we cannot obtain a lock immediately, or because we're
          leaving the access layer), the repositioning code should work more or
          less as it is.

          I'm hoping that saving the key each time we release the latch doesn't
          have a too heavy impact on single-threaded performance. For short
          keys, it should be relatively cheap, and the cost of saving the key
          may be outweighed by what we save by not having to maintain the scan
          protection lock.

          With the optimizations suggested for the repositioning in earlier
          comments, I believe that the repositioning will not be more expensive
          than today in the common case where the page hasn't been split.

          I'd welcome any comments and suggestions on how to implement this.

          Show
          Knut Anders Hatlen added a comment - We already have code to save the position by key and reposition using the saved key (methods savePosition() and reposition() in BTreeScan). Currently, we only save the key on transaction borders so that we can reposition a holdable cursor after a commit. This is needed because transactions release all their locks on commit, scan locks included. I think it should be possible to use the same logic to save the position within a transaction. The current position of an index scan is stored in a BTreeRowPosition object, either as a Page/RecordId pair which requires a scan protection lock, or as a key value. Every time the access layer is reentered, reposition() is called which either just reacquires the latch on the page (if there's a scan lock) or repositions from the root of the B-tree (if there's no scan lock). So I think that if we find a way to ensure that we always save the position by key when we're about to release the latch on a leaf page (either because we cannot obtain a lock immediately, or because we're leaving the access layer), the repositioning code should work more or less as it is. I'm hoping that saving the key each time we release the latch doesn't have a too heavy impact on single-threaded performance. For short keys, it should be relatively cheap, and the cost of saving the key may be outweighed by what we save by not having to maintain the scan protection lock. With the optimizations suggested for the repositioning in earlier comments, I believe that the repositioning will not be more expensive than today in the common case where the page hasn't been split. I'd welcome any comments and suggestions on how to implement this.
          Hide
          Jeff Stuckman added a comment -

          I'm also affected by this issue, and I'd like to note that a concurrent select and update can cause the undesired behavior, not just a concurrent select and insert.

          Summary:
          Even using READ_COMMITTED, a single non-updatable SELECT and a single UPDATE statement can deadlock against each other when an index includes the updated column.

          My test case uses the following table and index:
          CREATE TABLE urls (urlid INTEGER NOT NULL PRIMARY KEY, url VARCHAR(2048) NOT NULL UNIQUE, site INTEGER, expectation INTEGER, jobflag CHAR DEFAULT 'N'); CREATE INDEX findurlbysiteandjob ON urls(site,jobflag);

          My test case creates two threads and executes the following statements until they deadlock against each other:
          UPDATE urls SET jobflag=? WHERE urlid=?
          SELECT urlid,url,expectation FROM urls WHERE site=?

          The test eventually deadlocks with the following transaction and lock table
          contents:
          XID TYPE MODE TABLENAME LOCKNAME STATE TABLETYPE LOCKCOUNT INDEXNAME
          2217109 ROW S URLS (13,1) GRANT T 1 FINDURLBYSITEANDJOB
          2217114 ROW X URLS (13,1) WAIT T 0 FINDURLBYSITEANDJOB
          2217113 ROW S URLS (15,1) GRANT T 1 FINDURLBYSITEANDJOB
          2217113 ROW X URLS (3,132) GRANT T 3 null
          2217109 ROW S URLS (3,132) WAIT T 0 null
          2217109 TABLE IS URLS Tablelock GRANT T 2 null
          2217113 TABLE IX URLS Tablelock GRANT T 4 null
          2217114 TABLE IX URLS Tablelock GRANT T 1 null
          2217113 ROW S URLS (6,1) GRANT T 1 SQL081111021116970

          XID GLOBAL_XID USERNAME TYPE STATUS FIRST_INSTANT SQL_TEXT
          2217115 null APP UserTransaction IDLE null select * from SYSCS_DIAG.TRANSACTION_TABLE
          2217114 null APP InternalTransaction ACTIVE null UPDATE urls SET jobflag=? WHERE urlid=?
          2217113 null APP UserTransaction ACTIVE (526,52925) UPDATE urls SET jobflag=? WHERE urlid=?
          2069160 null null SystemTransaction IDLE null null
          2217109 null APP UserTransaction ACTIVE null

          1. The SELECT statement begins to execute and the cursor is stepping through the result set. The results are derived from index FINDURLBYSITEANDJOB as expected.
          2. The UPDATE statement begins to execute. The row to be updated is the row immediately after the SELECT statement's cursor. The row is locked and updated.
          3. The UPDATE statement must modify the index structure (tree rebalancing or similar?). It must lock the row that the SELECT statement's cursor is currently occupying. It cannot do this, so the transaction waits.
          4. The SELECT statement is ready to advance the cursor. However, it cannot advance the cursor because the UPDATE statement has locked the next row. The transaction waits and we have a deadlock.

          Apparently, the only way to avoid this deadlock is to LOCK TABLE before updating.

          Show
          Jeff Stuckman added a comment - I'm also affected by this issue, and I'd like to note that a concurrent select and update can cause the undesired behavior, not just a concurrent select and insert. Summary: Even using READ_COMMITTED, a single non-updatable SELECT and a single UPDATE statement can deadlock against each other when an index includes the updated column. My test case uses the following table and index: CREATE TABLE urls (urlid INTEGER NOT NULL PRIMARY KEY, url VARCHAR(2048) NOT NULL UNIQUE, site INTEGER, expectation INTEGER, jobflag CHAR DEFAULT 'N'); CREATE INDEX findurlbysiteandjob ON urls(site,jobflag); My test case creates two threads and executes the following statements until they deadlock against each other: UPDATE urls SET jobflag=? WHERE urlid=? SELECT urlid,url,expectation FROM urls WHERE site=? The test eventually deadlocks with the following transaction and lock table contents: XID TYPE MODE TABLENAME LOCKNAME STATE TABLETYPE LOCKCOUNT INDEXNAME 2217109 ROW S URLS (13,1) GRANT T 1 FINDURLBYSITEANDJOB 2217114 ROW X URLS (13,1) WAIT T 0 FINDURLBYSITEANDJOB 2217113 ROW S URLS (15,1) GRANT T 1 FINDURLBYSITEANDJOB 2217113 ROW X URLS (3,132) GRANT T 3 null 2217109 ROW S URLS (3,132) WAIT T 0 null 2217109 TABLE IS URLS Tablelock GRANT T 2 null 2217113 TABLE IX URLS Tablelock GRANT T 4 null 2217114 TABLE IX URLS Tablelock GRANT T 1 null 2217113 ROW S URLS (6,1) GRANT T 1 SQL081111021116970 XID GLOBAL_XID USERNAME TYPE STATUS FIRST_INSTANT SQL_TEXT 2217115 null APP UserTransaction IDLE null select * from SYSCS_DIAG.TRANSACTION_TABLE 2217114 null APP InternalTransaction ACTIVE null UPDATE urls SET jobflag=? WHERE urlid=? 2217113 null APP UserTransaction ACTIVE (526,52925) UPDATE urls SET jobflag=? WHERE urlid=? 2069160 null null SystemTransaction IDLE null null 2217109 null APP UserTransaction ACTIVE null 1. The SELECT statement begins to execute and the cursor is stepping through the result set. The results are derived from index FINDURLBYSITEANDJOB as expected. 2. The UPDATE statement begins to execute. The row to be updated is the row immediately after the SELECT statement's cursor. The row is locked and updated. 3. The UPDATE statement must modify the index structure (tree rebalancing or similar?). It must lock the row that the SELECT statement's cursor is currently occupying. It cannot do this, so the transaction waits. 4. The SELECT statement is ready to advance the cursor. However, it cannot advance the cursor because the UPDATE statement has locked the next row. The transaction waits and we have a deadlock. Apparently, the only way to avoid this deadlock is to LOCK TABLE before updating.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching an experimental patch (d2991-preview-1a.diff) to show how
          I'm planning to fix the deadlock.

          The patch makes the B-tree scans save the scan position and release
          the scan lock if they release the latch in the middle of the
          scan. Although the patch doesn't remove the use of scan locks, it does
          make sure that no transaction holds the scan lock on a page without
          holding the latch on the same page, so the scan locks don't have any
          effect anymore.

          The code to save the position and reposition could be used more or
          less in its current state. Some small adjustments needed to be made:

          • The existing savePosition methods couldn't be used directly since
            one of them assumed that the page was not latched when called, and
            the other one was meant to be called from other transactions that
            needed to tell the scan to save its position if it happened to be
            positioned on a given page. A new variant of savePosition was
            added. I believe that both of the old ones can be removed now,
            since the position will already have been saved and they won't
            have any work to do, but I need to check that.
          • savePosition/reposition needed to save/restore information about
            whether or not the row at the current position was locked (which
            is available from scan_position.current_rh when the scan is
            positioned). This information wasn't need earlier since the scan
            position would only be saved on transactions boundaries, in which
            case the locks would be released.

          The final solution should of course also remove the scan locks and
          include the earlier discussed optimization to prevent unnecessary
          repositioning when no rows have been moved off the page.

          I ran the attached repro (InsertSelectDeadlock.java) and didn't see
          any timeouts when the patch was applied.

          I have also run the full regression test suite. suites.All ran
          cleanly. Derbyall had five failures in the store tests:

          store/RowLockIso.sql
          store/rlliso2multi.sql
          store/rlliso3multi.sql
          store/readlocks.sql
          unit/T_b2i.unit

          The first four of them failed because (a) lock table dumps didn't
          contain scan locks any longer, and (b) some inserts didn't fail with a
          timeout trying to obtain the scan lock. These sound like expected
          changes in behaviour and should probably be fixed by updating the
          canons.

          The last failure (unit/T_b2i.unit) was caused by some debug code that
          is only enabled when running that test. The debug code simulates that
          the latch is lost, but it has not been updated to save the position
          before it releases the latch, so an assert is triggered when we later
          try to reposition. This should be fixed by changing the method
          OpenBTree.test_errors so that it saves the position if needed.

          It would be great if someone could take a look at the patch and verify
          that I'm on the right track and that I'm not missing something
          essential.

          Show
          Knut Anders Hatlen added a comment - Attaching an experimental patch (d2991-preview-1a.diff) to show how I'm planning to fix the deadlock. The patch makes the B-tree scans save the scan position and release the scan lock if they release the latch in the middle of the scan. Although the patch doesn't remove the use of scan locks, it does make sure that no transaction holds the scan lock on a page without holding the latch on the same page, so the scan locks don't have any effect anymore. The code to save the position and reposition could be used more or less in its current state. Some small adjustments needed to be made: The existing savePosition methods couldn't be used directly since one of them assumed that the page was not latched when called, and the other one was meant to be called from other transactions that needed to tell the scan to save its position if it happened to be positioned on a given page. A new variant of savePosition was added. I believe that both of the old ones can be removed now, since the position will already have been saved and they won't have any work to do, but I need to check that. savePosition/reposition needed to save/restore information about whether or not the row at the current position was locked (which is available from scan_position.current_rh when the scan is positioned). This information wasn't need earlier since the scan position would only be saved on transactions boundaries, in which case the locks would be released. The final solution should of course also remove the scan locks and include the earlier discussed optimization to prevent unnecessary repositioning when no rows have been moved off the page. I ran the attached repro (InsertSelectDeadlock.java) and didn't see any timeouts when the patch was applied. I have also run the full regression test suite. suites.All ran cleanly. Derbyall had five failures in the store tests: store/RowLockIso.sql store/rlliso2multi.sql store/rlliso3multi.sql store/readlocks.sql unit/T_b2i.unit The first four of them failed because (a) lock table dumps didn't contain scan locks any longer, and (b) some inserts didn't fail with a timeout trying to obtain the scan lock. These sound like expected changes in behaviour and should probably be fixed by updating the canons. The last failure (unit/T_b2i.unit) was caused by some debug code that is only enabled when running that test. The debug code simulates that the latch is lost, but it has not been updated to save the position before it releases the latch, so an assert is triggered when we later try to reposition. This should be fixed by changing the method OpenBTree.test_errors so that it saves the position if needed. It would be great if someone could take a look at the patch and verify that I'm on the right track and that I'm not missing something essential.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching an updated preview patch (d2991-preview-1b.diff). Now all
          the regression tests run cleanly. Differences from the previous patch:

          • updated store tests so that they didn't expect scan locks in the
            lock table dumps or conflicts on the scan lock
          • ensured that OpenBTree.test_errors() would save the position if
            the condition that is simulated would have resulted in the
            position being saved
          • verified that the existing savePosition methods could be removed,
            as I indicated in the previous comment, and removed the
            methods. This also made it possible to remove the methods from
            interfaces and no-op implementations of these methods in other
            scan types.
          Show
          Knut Anders Hatlen added a comment - Attaching an updated preview patch (d2991-preview-1b.diff). Now all the regression tests run cleanly. Differences from the previous patch: updated store tests so that they didn't expect scan locks in the lock table dumps or conflicts on the scan lock ensured that OpenBTree.test_errors() would save the position if the condition that is simulated would have resulted in the position being saved verified that the existing savePosition methods could be removed, as I indicated in the previous comment, and removed the methods. This also made it possible to remove the methods from interfaces and no-op implementations of these methods in other scan types.
          Hide
          Knut Anders Hatlen added a comment -

          Here's a new update (preview-1c).

          In this patch, I have removed the locking of the scan protection
          handle completely. This also led to the removal of some debug code
          used to simulate wait and/or deadlock when locking the scan, and
          therefore the test cases in T_b2i that expected this debug code to
          run also had to be removed.

          I ran some simple performance tests to get an impression of how the
          performance is affected (used o.a.derbyTesting.perf.clients.Runner).

          For single-record selects on primary key (-load sr_select) the
          throughput was increased by 5-10%. I suppose this improvement is seen
          because we no longer spend any time on obtaining and releasing the
          scan locks. And since there's no saving of position or repositioning
          happening with this load, no extra overhead is introduced.

          For 1000x10000 index joins (-load index_join) the throughput was down
          15% (single-threaded) to 45% (multi-threaded). These operations
          release the latch on the leaf frequently (once every 16 rows in the
          index scans), which would mean that a costly renavigation in the
          B-tree would happen very often. Additionally, the renavigation would
          increase the contention on the root node of the B-tree, which is
          probably why the performance drop is greater in the multi-threaded
          scenario.

          Since some operations may see a significant performance drop, I think
          the previously discussed optimization to prevent unnecessary
          renavigation needs to be implemented before we can consider the patch
          ready for commit.

          The patch is starting to get big, but I'm not sure how to split it up
          in smaller increments without temporarily degrading the performance of
          the trunk. The patch currently removes a lot more code than it adds,
          though, so the size of the patch doesn't tell the whole truth. The
          inserted/deleted ratio is about 1/7, as seen by

          $ svn diff | diffstat | tail -1
          27 files changed, 196 insertions, 1410 deletions

          I'll be away next week and will probably not have much time to work on
          this issue, but I will read my mail and try to answer questions.

          Show
          Knut Anders Hatlen added a comment - Here's a new update (preview-1c). In this patch, I have removed the locking of the scan protection handle completely. This also led to the removal of some debug code used to simulate wait and/or deadlock when locking the scan, and therefore the test cases in T_b2i that expected this debug code to run also had to be removed. I ran some simple performance tests to get an impression of how the performance is affected (used o.a.derbyTesting.perf.clients.Runner). For single-record selects on primary key (-load sr_select) the throughput was increased by 5-10%. I suppose this improvement is seen because we no longer spend any time on obtaining and releasing the scan locks. And since there's no saving of position or repositioning happening with this load, no extra overhead is introduced. For 1000x10000 index joins (-load index_join) the throughput was down 15% (single-threaded) to 45% (multi-threaded). These operations release the latch on the leaf frequently (once every 16 rows in the index scans), which would mean that a costly renavigation in the B-tree would happen very often. Additionally, the renavigation would increase the contention on the root node of the B-tree, which is probably why the performance drop is greater in the multi-threaded scenario. Since some operations may see a significant performance drop, I think the previously discussed optimization to prevent unnecessary renavigation needs to be implemented before we can consider the patch ready for commit. The patch is starting to get big, but I'm not sure how to split it up in smaller increments without temporarily degrading the performance of the trunk. The patch currently removes a lot more code than it adds, though, so the size of the patch doesn't tell the whole truth. The inserted/deleted ratio is about 1/7, as seen by $ svn diff | diffstat | tail -1 27 files changed, 196 insertions , 1410 deletions I'll be away next week and will probably not have much time to work on this issue, but I will read my mail and try to answer questions.
          Hide
          Bryan Pendleton added a comment -

          Thanks for running the performance tests. It seems like the cases
          where you are seeing some performance problems are mostly those
          cases where the deadlocks were occurring. Since a small
          performance penalty is vastly superior to a deadlock, I think it would
          be reasonable to accept that penalty if the deadlocks are greatly
          reduced or (hopefully) eliminated.

          Show
          Bryan Pendleton added a comment - Thanks for running the performance tests. It seems like the cases where you are seeing some performance problems are mostly those cases where the deadlocks were occurring. Since a small performance penalty is vastly superior to a deadlock, I think it would be reasonable to accept that penalty if the deadlocks are greatly reduced or (hopefully) eliminated.
          Hide
          Knut Anders Hatlen added a comment -

          I agree that the overhead would be acceptable if it had (almost) only
          affected those situations where a deadlock could occur, but now
          there's an overhead in other situations as well.

          A B-tree scan may release the latch on the current leaf page in the
          following situations:

          1) When a lock cannot be obtained without waiting

          2) When no more rows should be returned by the scan (current key >
          stop key)

          3) After a bulk fetch (where a bulk contains from 1 row to 16 rows)

          4) When the end of the current leaf has been reached and the next leaf
          has been latched

          With the latest patch, we save the current position by key in (1) and
          (3). In (2) it's not needed because the scan is complete and no
          repositioning will ever be attempted. In (4) it's not needed since
          we're now successfully positioned on the first row on the next page.

          The only situation in which a deadlock can occur, is (1), but we
          currently always reposition by renavigating the B-tree after having
          saved the position by key, so the overhead is also seen in
          (3). Unfortunately, almost any query that reads more than 16 rows from
          an index will be hit by this.

          I'm planning to update the experimental patch with the discussed
          optimization so that we don't need to renavigate the B-tree when
          repositioning in the common case. If that's successful, we can revisit
          whether the patch should be split and reviewed/committed in smaller
          pieces. If we have a proof of concept in place that shows that the
          performance issues are solvable, it might be acceptable to commit a
          partial solution that causes performance regressions for some
          operations for a shorter period of time (a week or so) before we
          commit the final patch.

          Show
          Knut Anders Hatlen added a comment - I agree that the overhead would be acceptable if it had (almost) only affected those situations where a deadlock could occur, but now there's an overhead in other situations as well. A B-tree scan may release the latch on the current leaf page in the following situations: 1) When a lock cannot be obtained without waiting 2) When no more rows should be returned by the scan (current key > stop key) 3) After a bulk fetch (where a bulk contains from 1 row to 16 rows) 4) When the end of the current leaf has been reached and the next leaf has been latched With the latest patch, we save the current position by key in (1) and (3). In (2) it's not needed because the scan is complete and no repositioning will ever be attempted. In (4) it's not needed since we're now successfully positioned on the first row on the next page. The only situation in which a deadlock can occur, is (1), but we currently always reposition by renavigating the B-tree after having saved the position by key, so the overhead is also seen in (3). Unfortunately, almost any query that reads more than 16 rows from an index will be hit by this. I'm planning to update the experimental patch with the discussed optimization so that we don't need to renavigate the B-tree when repositioning in the common case. If that's successful, we can revisit whether the patch should be split and reviewed/committed in smaller pieces. If we have a proof of concept in place that shows that the performance issues are solvable, it might be acceptable to commit a partial solution that causes performance regressions for some operations for a shorter period of time (a week or so) before we commit the final patch.
          Hide
          Kathey Marsden added a comment -

          Thanks Knut for working on this issue. I know it is important to several users. For one of those users, Myrna backported the 1c patch along with the fix for DERBY-2878 to 10.3 to try out (Thanks Myrna). The user reported that it resolved all of their lock timeouts due to this issue and passed their regression tests. So great work Knut! We also have a request to backport this once done all the way to 10.1 for a user locked into that version. So my questions are:

          1) Do you have any kind of general estimate when the final patch might be ready for trunk?
          2) Do you have any reservations or concerns about my trying to backport the fix once complete to older codelines? Do you know if that is even feasible for 10.1?
          3) Is there anything that I can do to help with this issue?

          Thanks

          Kathey

          Show
          Kathey Marsden added a comment - Thanks Knut for working on this issue. I know it is important to several users. For one of those users, Myrna backported the 1c patch along with the fix for DERBY-2878 to 10.3 to try out (Thanks Myrna). The user reported that it resolved all of their lock timeouts due to this issue and passed their regression tests. So great work Knut! We also have a request to backport this once done all the way to 10.1 for a user locked into that version. So my questions are: 1) Do you have any kind of general estimate when the final patch might be ready for trunk? 2) Do you have any reservations or concerns about my trying to backport the fix once complete to older codelines? Do you know if that is even feasible for 10.1? 3) Is there anything that I can do to help with this issue? Thanks Kathey
          Hide
          Mike Matrigali added a comment -

          I am concerned about performance of always saving full key. This is going to cause both cpu and memory problems(garbage collection). As originally designed it was expected that this path would be rare so no optimization was ever done. The save by id path was the one optimized.

          Did you ever consider fixing by only changing the deadlock case, ie always giving up the scan position lock when one has to wait on a lock? This case should be rare, compared with the very normal case of group scan which happens by default for any scan of btree returning more than 16 rows for one query.

          sorry for late comment, was out on vacation when you started on this.

          Show
          Mike Matrigali added a comment - I am concerned about performance of always saving full key. This is going to cause both cpu and memory problems(garbage collection). As originally designed it was expected that this path would be rare so no optimization was ever done. The save by id path was the one optimized. Did you ever consider fixing by only changing the deadlock case, ie always giving up the scan position lock when one has to wait on a lock? This case should be rare, compared with the very normal case of group scan which happens by default for any scan of btree returning more than 16 rows for one query. sorry for late comment, was out on vacation when you started on this.
          Hide
          Mike Matrigali added a comment -

          After reviewing the current patch, this does not look like a change appropriate for a backport. It is already 5000 lines of diffs and changes a basic concurrency building block of the btree code. Getting rid of the scan lock if it can be done with little or no performance penality does look like a good feature for a major release. It will simplify a lot of code, but would like to see a lot of testing before it gets into an official derby release.

          I do agree this is a good direction. Some of the code that originally depended on the scan protection lock no longer needs to as part of work that was later done to support the read uncommitted isolation level. Off the top of my head my areas of concern would be that the following operations work correctly in all combinations without the concurrency protection provided by the scan protection lock (I think they are ok as they should be protected by latches, but just get worried removing locking from these operations):
          split
          merge
          reclaim deleted rows

          Another area that might be worth thinking about is to make sure the code that get's previous key locks for serializable is still right without the scan protection lock. Need to make sure an intervening split which previously was not possible does not break this code.

          This stuff is hard to test for, I will think about these operations and see if there is any hidden dependency or if they only got the scan protection lock to enable the scan optimization.

          I definitely do worry about having to copy around the full key every time one gives up the latch. Given the current store interface we can't keep references so we have to allocate
          objects. This in the worst case can lead to allocation/copies for every index reference in a query and can quickly add up which was why the additional complicaiton of the scan lock was added in the first place. It would be interesting to understand the performance overhead of the copy vs. the extra search. As I understand it the proposed optimization would in the "usual" case as long as the page remained in memory eliminate the scan but would not eliminate the copy. Probably the worst case is long keys (maybe multi-part) and maybe datatypes that require object allocations every time they are copied (like decimal).

          Show
          Mike Matrigali added a comment - After reviewing the current patch, this does not look like a change appropriate for a backport. It is already 5000 lines of diffs and changes a basic concurrency building block of the btree code. Getting rid of the scan lock if it can be done with little or no performance penality does look like a good feature for a major release. It will simplify a lot of code, but would like to see a lot of testing before it gets into an official derby release. I do agree this is a good direction. Some of the code that originally depended on the scan protection lock no longer needs to as part of work that was later done to support the read uncommitted isolation level. Off the top of my head my areas of concern would be that the following operations work correctly in all combinations without the concurrency protection provided by the scan protection lock (I think they are ok as they should be protected by latches, but just get worried removing locking from these operations): split merge reclaim deleted rows Another area that might be worth thinking about is to make sure the code that get's previous key locks for serializable is still right without the scan protection lock. Need to make sure an intervening split which previously was not possible does not break this code. This stuff is hard to test for, I will think about these operations and see if there is any hidden dependency or if they only got the scan protection lock to enable the scan optimization. I definitely do worry about having to copy around the full key every time one gives up the latch. Given the current store interface we can't keep references so we have to allocate objects. This in the worst case can lead to allocation/copies for every index reference in a query and can quickly add up which was why the additional complicaiton of the scan lock was added in the first place. It would be interesting to understand the performance overhead of the copy vs. the extra search. As I understand it the proposed optimization would in the "usual" case as long as the page remained in memory eliminate the scan but would not eliminate the copy. Probably the worst case is long keys (maybe multi-part) and maybe datatypes that require object allocations every time they are copied (like decimal).
          Hide
          Knut Anders Hatlen added a comment -

          Thanks to all of you for looking at this.

          To Kathey's questions, I will get back to working on this issue next
          week and hope to have optimized the patch further and tested it some
          more by the end of that week. It's difficult to say when a fix can be
          committed to trunk, as it depends on whether we find the suggested
          approach acceptable or if we need to find another way to fix it. And
          as Mike mentioned more testing is needed in any case. I do hope we can
          come up with an approach everyone feels comfortable with in the next
          couple of weeks before the holiday season, and then get it implemented
          and properly tested in January. I agree with Mike though, that it's
          probably too risky to backport the fix, unless we end up with
          something simpler than the current proposal. Not that I'm planning to
          introduce any bugs, but since it changes some fundamental assumptions
          in a critical part of the system, I wouldn't feel all that comfortable
          with backporting it.

          To Mike's questions:

          Yes, I think it is possible to only save the position in situations
          that could cause deadlocks. That would mean that we only save the
          position if we need to wait for a lock. One possible complication is
          that it's not necessarily the scan that's waiting for a lock that's
          holding the scan lock that causes the deadlock. Therefore we would
          need to save the position in all the open B-tree scans in the
          transaction when a lock cannot be granted immediately. And this must
          be done for all locks that we wait on, not only locks in B-tree scans,
          so it will add complexity to code outside the B-tree code as
          well. Given that this increases the complexity and removing the scan
          locks reduces the complexity, I'd prefer a solution where the scan
          locks are removed completely if it can be done with acceptable
          performance.

          As to the performance, I hope that it will be possible to measure the
          actual cost of copying the key once I have removed the expensive and
          unnecessary renavigation. You are right that the suggested
          optimization will only save the renavigation, not the copying. Some
          thoughts/guesses:

          • The current code that saves the key always allocates a new
            DataValueDescriptor array and populates it with empty
            DataValueDescriptors of the correct type. It shouldn't be
            necessary to do this more than once per scan. That could save much
            allocation/gc work.
          • When we save the key, we retrieve it by calling fetchFromSlot(),
            which will deserialize the value of the key from the page. I think
            that this is causing the same work to be performed twice when we
            save the position in methods like BTreeForwardScan.fetchRows(), at
            least when the scan is returning the full key, and we could
            probably make this more efficient.
          • What we save by not obtaining/releasing the scan lock includes 2-3
            lookups in a global hashtable per leaf page, which could give an
            extra benefit in a multi-threaded scenario.
          • It's not clear to me that the size of the key is significant. The
            cost of the scan lock is constant per page. Since a bigger key
            means fewer keys per page, the cost of copying keys should also be
            more or less constant per page. So the ratio between the cost
            saved and the cost added may not be that much affected by the size
            of the key.
          • One of the worst cases is probably where extra qualification of
            the row is needed. Like "SELECT index_column FROM t WHERE
            length(index_column) = 10", in which case I think we'll release
            the latch, and therefore also save the key, for every row in the
            index.

          The suggestions for new tests look reasonable to me. The current
          regression tests probably don't test that the current position has
          been moved to another page while we didn't hold the latch, since those
          situations would have resulted in a lock timeout before. So we
          basically need to have tests for split/merge/reclaim for every call to
          reposition() in the code. Will need to think more about how to do
          that. Thanks for raising the concern for the previous key locking as
          well.

          Show
          Knut Anders Hatlen added a comment - Thanks to all of you for looking at this. To Kathey's questions, I will get back to working on this issue next week and hope to have optimized the patch further and tested it some more by the end of that week. It's difficult to say when a fix can be committed to trunk, as it depends on whether we find the suggested approach acceptable or if we need to find another way to fix it. And as Mike mentioned more testing is needed in any case. I do hope we can come up with an approach everyone feels comfortable with in the next couple of weeks before the holiday season, and then get it implemented and properly tested in January. I agree with Mike though, that it's probably too risky to backport the fix, unless we end up with something simpler than the current proposal. Not that I'm planning to introduce any bugs, but since it changes some fundamental assumptions in a critical part of the system, I wouldn't feel all that comfortable with backporting it. To Mike's questions: Yes, I think it is possible to only save the position in situations that could cause deadlocks. That would mean that we only save the position if we need to wait for a lock. One possible complication is that it's not necessarily the scan that's waiting for a lock that's holding the scan lock that causes the deadlock. Therefore we would need to save the position in all the open B-tree scans in the transaction when a lock cannot be granted immediately. And this must be done for all locks that we wait on, not only locks in B-tree scans, so it will add complexity to code outside the B-tree code as well. Given that this increases the complexity and removing the scan locks reduces the complexity, I'd prefer a solution where the scan locks are removed completely if it can be done with acceptable performance. As to the performance, I hope that it will be possible to measure the actual cost of copying the key once I have removed the expensive and unnecessary renavigation. You are right that the suggested optimization will only save the renavigation, not the copying. Some thoughts/guesses: The current code that saves the key always allocates a new DataValueDescriptor array and populates it with empty DataValueDescriptors of the correct type. It shouldn't be necessary to do this more than once per scan. That could save much allocation/gc work. When we save the key, we retrieve it by calling fetchFromSlot(), which will deserialize the value of the key from the page. I think that this is causing the same work to be performed twice when we save the position in methods like BTreeForwardScan.fetchRows(), at least when the scan is returning the full key, and we could probably make this more efficient. What we save by not obtaining/releasing the scan lock includes 2-3 lookups in a global hashtable per leaf page, which could give an extra benefit in a multi-threaded scenario. It's not clear to me that the size of the key is significant. The cost of the scan lock is constant per page. Since a bigger key means fewer keys per page, the cost of copying keys should also be more or less constant per page. So the ratio between the cost saved and the cost added may not be that much affected by the size of the key. One of the worst cases is probably where extra qualification of the row is needed. Like "SELECT index_column FROM t WHERE length(index_column) = 10", in which case I think we'll release the latch, and therefore also save the key, for every row in the index. The suggestions for new tests look reasonable to me. The current regression tests probably don't test that the current position has been moved to another page while we didn't hold the latch, since those situations would have resulted in a lock timeout before. So we basically need to have tests for split/merge/reclaim for every call to reposition() in the code. Will need to think more about how to do that. Thanks for raising the concern for the previous key locking as well.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a new preview patch (1d) that implements the optimization
          which prevents the expensive repositioning when no rows have been
          moved off the page. I haven't done any serious testing of the
          correctness in all scenarios, I'm just posting it so that we can get
          more information about the performance impact of saving the position.

          As an initial test, I reran the index-join test that I mentioned
          earlier and that showed a significant decrease in performance with the
          preview-1c patch. With the preview-1d patch, I'm not able to see any
          decrease (the numbers are actually slightly better with the 1d patch
          than with trunk, though I'm not sure whether it's significant). Below
          are the results for 1, 2, 4, 8 and 16 concurrent threads for the 1d
          patch (rows prefixed with 'd2991') and trunk (rows prefixed with
          'trunk'). The AVG_TPS column shows the number of transactions per
          second, all configurations were run 30 times (each test run had 30 sec
          warmup and 45 sec steady state) and the AVG_TPS column shows the
          average from the 30 runs.

          JAR |THREADS |AVG_TPS
          ---------------------------------------------
          d2991 |1 |136.3599695946353
          trunk |1 |136.0475449900022
          d2991 |2 |265.9779264453826
          trunk |2 |262.8875064800415
          d2991 |4 |262.95638006368955
          trunk |4 |255.46471154558247
          d2991 |8 |256.97232845038025
          trunk |8 |252.92527586462265
          d2991 |16 |256.84847624018147
          trunk |16 |254.06894019626188

          I will see if I can write some more specific tests that test the other
          possible worst-case scenarios suggested in earlier comments.

          What the patch does, is that it saves the version of the page in the
          BTreeRowPosition object before it saves the page, and each time rows
          can be moved off the page (in operations that previously acquired an
          exclusive scan lock) this is flagged in the in-memory representation
          of the page. BTreeScan.reposition() will first match the version in
          BTreeRowPosition with the flagged version number on the page, and will
          only reposition by key if the page has been flagged after the position
          was saved. Pages are automatically flagged with the current version
          when they are fetched into the cache, so we'll always reposition by
          key if the page has been evicted from the cache after we saved the
          position.

          The patch also makes savePosition() reuse the DataValueDescriptor
          object in which the key is stored, so that it's only allocated once
          per scan.

          Currently, the patch does not handle the case where the page has
          disappeared when we reposition (only happens with holdable cursors
          after a commit followed by truncate table, I think) or the case where
          the leaf page has turned into a branch page while we waited. These
          problems led to four failures in derbyall and four failures in
          suites.All.

          Show
          Knut Anders Hatlen added a comment - Attaching a new preview patch (1d) that implements the optimization which prevents the expensive repositioning when no rows have been moved off the page. I haven't done any serious testing of the correctness in all scenarios, I'm just posting it so that we can get more information about the performance impact of saving the position. As an initial test, I reran the index-join test that I mentioned earlier and that showed a significant decrease in performance with the preview-1c patch. With the preview-1d patch, I'm not able to see any decrease (the numbers are actually slightly better with the 1d patch than with trunk, though I'm not sure whether it's significant). Below are the results for 1, 2, 4, 8 and 16 concurrent threads for the 1d patch (rows prefixed with 'd2991') and trunk (rows prefixed with 'trunk'). The AVG_TPS column shows the number of transactions per second, all configurations were run 30 times (each test run had 30 sec warmup and 45 sec steady state) and the AVG_TPS column shows the average from the 30 runs. JAR |THREADS |AVG_TPS --------------------------------------------- d2991 |1 |136.3599695946353 trunk |1 |136.0475449900022 d2991 |2 |265.9779264453826 trunk |2 |262.8875064800415 d2991 |4 |262.95638006368955 trunk |4 |255.46471154558247 d2991 |8 |256.97232845038025 trunk |8 |252.92527586462265 d2991 |16 |256.84847624018147 trunk |16 |254.06894019626188 I will see if I can write some more specific tests that test the other possible worst-case scenarios suggested in earlier comments. What the patch does, is that it saves the version of the page in the BTreeRowPosition object before it saves the page, and each time rows can be moved off the page (in operations that previously acquired an exclusive scan lock) this is flagged in the in-memory representation of the page. BTreeScan.reposition() will first match the version in BTreeRowPosition with the flagged version number on the page, and will only reposition by key if the page has been flagged after the position was saved. Pages are automatically flagged with the current version when they are fetched into the cache, so we'll always reposition by key if the page has been evicted from the cache after we saved the position. The patch also makes savePosition() reuse the DataValueDescriptor object in which the key is stored, so that it's only allocated once per scan. Currently, the patch does not handle the case where the page has disappeared when we reposition (only happens with holdable cursors after a commit followed by truncate table, I think) or the case where the leaf page has turned into a branch page while we waited. These problems led to four failures in derbyall and four failures in suites.All.
          Hide
          Knut Anders Hatlen added a comment -

          Here's a small performance test that I wrote. It extends the JDBCPerfTestCase class in the JUnit framework. All the tests use a table with 1000 rows. The table has 13 columns: one VARCHAR(10), one VARCHAR(100), one VARCHAR(1000) and ten DECIMAL(10,10). Each single VARCHAR column has an index, one compound index is covering all the VARCHAR columns, there's an index on one of the DECIMAL columns, and there's one covering all the DECIMAL columns. For each index, there's a test case that simply runs through the entire index.

          After a couple of repeated runs I can't say that saving the position looks more expensive than locking the scan. Actually, the tests appear to perform slightly better when the position is saved, at least that's what it looks like in my environment (OpenSolaris 2008.11, Java 1.6.0_10). (Except that in some cases it fails because the JUnit framework compresses all the tables before starting the test, and the problem with pages changing from leaf to branch mentioned in my previous comment surfaces.) I'll need to run the tests more in order to say anything for sure. I will do that and report back.

          Does anyone have suggestions for any other performance tests we should run to check if this is an appropriate path to follow?

          Show
          Knut Anders Hatlen added a comment - Here's a small performance test that I wrote. It extends the JDBCPerfTestCase class in the JUnit framework. All the tests use a table with 1000 rows. The table has 13 columns: one VARCHAR(10), one VARCHAR(100), one VARCHAR(1000) and ten DECIMAL(10,10). Each single VARCHAR column has an index, one compound index is covering all the VARCHAR columns, there's an index on one of the DECIMAL columns, and there's one covering all the DECIMAL columns. For each index, there's a test case that simply runs through the entire index. After a couple of repeated runs I can't say that saving the position looks more expensive than locking the scan. Actually, the tests appear to perform slightly better when the position is saved, at least that's what it looks like in my environment (OpenSolaris 2008.11, Java 1.6.0_10). (Except that in some cases it fails because the JUnit framework compresses all the tables before starting the test, and the problem with pages changing from leaf to branch mentioned in my previous comment surfaces.) I'll need to run the tests more in order to say anything for sure. I will do that and report back. Does anyone have suggestions for any other performance tests we should run to check if this is an appropriate path to follow?
          Hide
          Mike Matrigali added a comment -

          that's great news on initial performance comparisons.

          I am not sure if the following will do better, but it is what I thought of. It seems like the worst case would be cases where we move the scan a lot without having latch on page. ie. case where we get one row at a time from the page. So I would suggest using the bulk fetch property to only get 1 row at a time:
          CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.language.bulkFetchDefault','1');

          1000 rows does not seem like enough data to do a test. One of the problems with the new approach, especially with something like decimal is that there is going to be extra object allocation per row. So you want enough rows to see what effect it might have on GC. I usually tried to make sure performance test took at least a second, and maybe 60 seconds
          for at least a one time run.

          Is a read only test affected by pages going out of cache? From your description I don't think so unless you get very strange cache behavior in that the scan current page goes out of cache during the scan of that page. But might be worth showing that a both approaches perform same on a scan of a
          table that is bigger than the cache when the scan is executed multiple times in a row.

          Should there be something that exercises the worst case where the page is updated? Again for this having a bigger dataset wil cause more overhead for the
          search. For this maybe something like a 2 part key with first part being as the current test
          and the second part being an int and the test runs a cursor through the table and after reading each index row then updates the second part of the key by minus one, likely leaving the key on the same page - but not putting the scan into an infinite loop. Make sure the cursor choice is one which goes to the table every time, not one that caches the
          values in the client somehow.

          Show
          Mike Matrigali added a comment - that's great news on initial performance comparisons. I am not sure if the following will do better, but it is what I thought of. It seems like the worst case would be cases where we move the scan a lot without having latch on page. ie. case where we get one row at a time from the page. So I would suggest using the bulk fetch property to only get 1 row at a time: CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.language.bulkFetchDefault','1'); 1000 rows does not seem like enough data to do a test. One of the problems with the new approach, especially with something like decimal is that there is going to be extra object allocation per row. So you want enough rows to see what effect it might have on GC. I usually tried to make sure performance test took at least a second, and maybe 60 seconds for at least a one time run. Is a read only test affected by pages going out of cache? From your description I don't think so unless you get very strange cache behavior in that the scan current page goes out of cache during the scan of that page. But might be worth showing that a both approaches perform same on a scan of a table that is bigger than the cache when the scan is executed multiple times in a row. Should there be something that exercises the worst case where the page is updated? Again for this having a bigger dataset wil cause more overhead for the search. For this maybe something like a 2 part key with first part being as the current test and the second part being an int and the test runs a cursor through the table and after reading each index row then updates the second part of the key by minus one, likely leaving the key on the same page - but not putting the scan into an infinite loop. Make sure the cursor choice is one which goes to the table every time, not one that caches the values in the client somehow.
          Hide
          Knut Anders Hatlen added a comment -

          I had the JUnit test running repeatedly over the weekend, 300 times
          for each configuration to get a more reliable average. I've summarized
          the results in the table below. Positive values in the increase column
          means that trunk is faster, negative values means that the patched
          version is faster.

          TEST |TRUNK_TIME |PATCHED_TI&|INCREASE |INCREASE_PERCENT
          -------------------------------------------------------------------------------
          decimal10columns |17752 |17761 |9 |0.0506985128436277
          decimal1column |4875 |4842 |-33 |-0.676923076923075
          varchar10 |4586 |4517 |-69 |-1.5045791539467945
          varchar100 |7288 |7185 |-103 |-1.4132821075740987
          varchar1000 |35977 |37008 |1031 |2.865719765405683
          varcharAll |41502 |42701 |1199 |2.8890173967519583

          Not much difference between them. For smaller keys, it seems like
          saving the position is slightly cheaper, whereas for larger keys it
          appears to be slightly more expensive.

          Show
          Knut Anders Hatlen added a comment - I had the JUnit test running repeatedly over the weekend, 300 times for each configuration to get a more reliable average. I've summarized the results in the table below. Positive values in the increase column means that trunk is faster, negative values means that the patched version is faster. TEST |TRUNK_TIME |PATCHED_TI&|INCREASE |INCREASE_PERCENT ------------------------------------------------------------------------------- decimal10columns |17752 |17761 |9 |0.0506985128436277 decimal1column |4875 |4842 |-33 |-0.676923076923075 varchar10 |4586 |4517 |-69 |-1.5045791539467945 varchar100 |7288 |7185 |-103 |-1.4132821075740987 varchar1000 |35977 |37008 |1031 |2.865719765405683 varcharAll |41502 |42701 |1199 |2.8890173967519583 Not much difference between them. For smaller keys, it seems like saving the position is slightly cheaper, whereas for larger keys it appears to be slightly more expensive.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for your comments, Mike.

          > I am not sure if the following will do better, but it is what I
          > thought of. It seems like the worst case would be cases where we
          > move the scan a lot without having latch on page. ie. case where we
          > get one row at a time from the page. So I would suggest using the
          > bulk fetch property to only get 1 row at a time: CALL
          > SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.language.bulkFetchDefault','1');

          Thanks, I wasn't aware of that property. I'll do some experiments with
          it.

          > 1000 rows does not seem like enough data to do a test. One of the
          > problems with the new approach, especially with something like
          > decimal is that there is going to be extra object allocation per
          > row. So you want enough rows to see what effect it might have on
          > GC.

          I picked 1000 rows to make sure that we had at least a couple of full
          index pages in each of the test cases. I assumed that having multiple
          iterations on a smaller number of rows would have about the same
          effect on GC as fewer iterations on a larger data set.

          > I usually tried to make sure performance test took at least a
          > second, and maybe 60 seconds for at least a one time run.

          The tests mentioned in my previous comment had a fixed number of
          iterations, so the total time for each test case varied between 20 and
          150 seconds depending on key size. (The times in milliseconds in the
          table must be multiplied by four to get the actual time spent because
          of the way the framework runs the tests.) I can also try some longer
          runs.

          > Is a read only test affected by pages going out of cache? From your
          > description I don't think so unless you get very strange cache
          > behavior in that the scan current page goes out of cache during the
          > scan of that page.

          That's correct. Read only tests should not be affected by pages going
          out of the cache. Even if the current page goes out while we don't
          hold the latch, they shouldn't be affected unless there has been a
          modification on the page after the latch was released. In that case
          the page will have the dirty flag in addition to the recently-used
          flag, so it's highly unlikely that it will be chosen for eviction.

          > But might be worth showing that a both approaches
          > perform same on a scan of a table that is bigger than the cache when
          > the scan is executed multiple times in a row.

          Good point. I'll run such a test.

          > Should there be something that exercises the worst case where the
          > page is updated? Again for this having a bigger dataset wil cause
          > more overhead for the search. For this maybe something like a 2 part
          > key with first part being as the current test and the second part
          > being an int and the test runs a cursor through the table and after
          > reading each index row then updates the second part of the key by
          > minus one, likely leaving the key on the same page - but not putting
          > the scan into an infinite loop. Make sure the cursor choice is one
          > which goes to the table every time, not one that caches the values
          > in the client somehow.

          I think I understand what you're suggesting, but a couple of questions
          just to make sure that I'm not missing the point:

          • the purpose of this test is to check that we don't perform any
            repositioning by key just because there has been an update on the
            page?
          • by making sure that the client doesn't cache the values, you
            basically mean that bulk fetch should be disabled? So setting
            derby.language.bulkFetchDefault or perhaps using an updatable
            cursor and positioned updates should do the trick?
          Show
          Knut Anders Hatlen added a comment - Thanks for your comments, Mike. > I am not sure if the following will do better, but it is what I > thought of. It seems like the worst case would be cases where we > move the scan a lot without having latch on page. ie. case where we > get one row at a time from the page. So I would suggest using the > bulk fetch property to only get 1 row at a time: CALL > SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.language.bulkFetchDefault','1'); Thanks, I wasn't aware of that property. I'll do some experiments with it. > 1000 rows does not seem like enough data to do a test. One of the > problems with the new approach, especially with something like > decimal is that there is going to be extra object allocation per > row. So you want enough rows to see what effect it might have on > GC. I picked 1000 rows to make sure that we had at least a couple of full index pages in each of the test cases. I assumed that having multiple iterations on a smaller number of rows would have about the same effect on GC as fewer iterations on a larger data set. > I usually tried to make sure performance test took at least a > second, and maybe 60 seconds for at least a one time run. The tests mentioned in my previous comment had a fixed number of iterations, so the total time for each test case varied between 20 and 150 seconds depending on key size. (The times in milliseconds in the table must be multiplied by four to get the actual time spent because of the way the framework runs the tests.) I can also try some longer runs. > Is a read only test affected by pages going out of cache? From your > description I don't think so unless you get very strange cache > behavior in that the scan current page goes out of cache during the > scan of that page. That's correct. Read only tests should not be affected by pages going out of the cache. Even if the current page goes out while we don't hold the latch, they shouldn't be affected unless there has been a modification on the page after the latch was released. In that case the page will have the dirty flag in addition to the recently-used flag, so it's highly unlikely that it will be chosen for eviction. > But might be worth showing that a both approaches > perform same on a scan of a table that is bigger than the cache when > the scan is executed multiple times in a row. Good point. I'll run such a test. > Should there be something that exercises the worst case where the > page is updated? Again for this having a bigger dataset wil cause > more overhead for the search. For this maybe something like a 2 part > key with first part being as the current test and the second part > being an int and the test runs a cursor through the table and after > reading each index row then updates the second part of the key by > minus one, likely leaving the key on the same page - but not putting > the scan into an infinite loop. Make sure the cursor choice is one > which goes to the table every time, not one that caches the values > in the client somehow. I think I understand what you're suggesting, but a couple of questions just to make sure that I'm not missing the point: the purpose of this test is to check that we don't perform any repositioning by key just because there has been an update on the page? by making sure that the client doesn't cache the values, you basically mean that bulk fetch should be disabled? So setting derby.language.bulkFetchDefault or perhaps using an updatable cursor and positioned updates should do the trick?
          Hide
          Knut Anders Hatlen added a comment -

          I checked in the performance test with revision 729039.

          Show
          Knut Anders Hatlen added a comment - I checked in the performance test with revision 729039.
          Hide
          Knut Anders Hatlen added a comment -

          Here are the results from running the same tests with derby.language.bulkFetchDefault=1 (20 runs with each configuration):

          TEST |TRUNK_TIME |PATCHED_TI&|INCREASE |INC_PERCENT
          -------------------------------------------------------------------------------
          decimal10columns |28390 |33116 |4726 |16.646706586826348
          decimal1column |16638 |17997 |1359 |8.168049044356293
          varchar10 |15062 |16426 |1364 |9.055902270614792
          varchar100 |18655 |21829 |3174 |17.014205306888233
          varchar1000 |47123 |70214 |23091 |49.001549137363924
          varcharAll |52362 |78693 |26331 |50.286467285436

          When bulk fetch is disabled, saving the position by key is clearly more expensive for these scans. With small keys, the extra cost appears to be moderate, but with large keys it is quite high.

          Show
          Knut Anders Hatlen added a comment - Here are the results from running the same tests with derby.language.bulkFetchDefault=1 (20 runs with each configuration): TEST |TRUNK_TIME |PATCHED_TI&|INCREASE |INC_PERCENT ------------------------------------------------------------------------------- decimal10columns |28390 |33116 |4726 |16.646706586826348 decimal1column |16638 |17997 |1359 |8.168049044356293 varchar10 |15062 |16426 |1364 |9.055902270614792 varchar100 |18655 |21829 |3174 |17.014205306888233 varchar1000 |47123 |70214 |23091 |49.001549137363924 varcharAll |52362 |78693 |26331 |50.286467285436 When bulk fetch is disabled, saving the position by key is clearly more expensive for these scans. With small keys, the extra cost appears to be moderate, but with large keys it is quite high.
          Hide
          Knut Anders Hatlen added a comment -

          In BTreeForwardScan.fetchRows() we have already fetched the key (or
          parts of the key) before we save the position. So I experimented with
          copying the key instead of re-reading the entire record. The copying
          was performed by iterating over fetch_row and calling setValue() on
          each element of scan_position.current_positionKey. This eliminated
          much of the overhead. A rerun of the JUnit test with
          derby.language.bulkFetchDefault=1 gave these results (average of 20
          runs for each configuration):

          TEST |TRUNK_TIME |PATCHED_TI&|INCREASE |INC_PERCENT
          ----------------------------------------------------------------------
          decimal10columns |28427 |29727 |1300 |4.5731173
          decimal1column |16572 |16890 |318 |1.9188993
          varchar10 |14715 |15673 |958 |6.5103636
          varchar100 |18378 |19472 |1094 |5.9527693
          varchar1000 |47092 |51618 |4526 |9.610974
          varcharAll |52457 |57628 |5171 |9.857598

          This will of course only work if the entire key is fetched by the
          scan. A proper solution would have to read the missing parts of the
          key if only parts of it are used by the scan.

          Show
          Knut Anders Hatlen added a comment - In BTreeForwardScan.fetchRows() we have already fetched the key (or parts of the key) before we save the position. So I experimented with copying the key instead of re-reading the entire record. The copying was performed by iterating over fetch_row and calling setValue() on each element of scan_position.current_positionKey. This eliminated much of the overhead. A rerun of the JUnit test with derby.language.bulkFetchDefault=1 gave these results (average of 20 runs for each configuration): TEST |TRUNK_TIME |PATCHED_TI&|INCREASE |INC_PERCENT ---------------------------------------------------------------------- decimal10columns |28427 |29727 |1300 |4.5731173 decimal1column |16572 |16890 |318 |1.9188993 varchar10 |14715 |15673 |958 |6.5103636 varchar100 |18378 |19472 |1094 |5.9527693 varchar1000 |47092 |51618 |4526 |9.610974 varcharAll |52457 |57628 |5171 |9.857598 This will of course only work if the entire key is fetched by the scan. A proper solution would have to read the missing parts of the key if only parts of it are used by the scan.
          Hide
          Knut Anders Hatlen added a comment -

          I have taken another look at the previous key locking. If the latches
          are released when we're searching backward for the previous key, it
          seems like we'll rescan the tree, so it doesn't seem like we're
          depending on the scan lock. Some relevant code fragments and comments:

          • BTreeController.doIns():

          while (true)
          {
          .
          .
          .
          // Row locking - first lock row previous to row being inserted:
          .
          .
          .
          if (latch_released)

          { // Had to release latch in order to get the lock, probably // because of a forward scanner, research tree, and try again. targetleaf = null; continue; }
          • B2IRowLocking3._lockScanRow():

          if (pos.current_slot == 0)
          {
          // this call will take care of searching left in the btree
          // to find the previous row to lock, 0 is the control row and
          // not a valid thing to lock as a previous key.

          // it is ok to call the non-scan as this is just a special
          // case of a previous key lock call. The only scan code that
          // will call this routine with slot == 0 will retry if this
          // routine returns that a latch was released.

          latch_released =
          !lockNonScanPreviousRow(

          • B2IRowLocking3.lockNonScanPreviousRow():

          // RESOLVE RLL (mikem) - NO RECORD_ID PROTECTION IN EFFECT.
          // caller must research, get new locks if this routine
          // releases latches.
          ret_status = this.searchLeftAndLockPreviousKey(

          • B2IRowLocking3.searchLeftAndLockPreviousKey():
          • If along the search a latch has to be waited on then latches are
          • released and a wait is performed, and "false" status is returned to
          • caller. In this case the routine can no longer be sure of it's current
          • position and may have to retry the whole operation.
          Show
          Knut Anders Hatlen added a comment - I have taken another look at the previous key locking. If the latches are released when we're searching backward for the previous key, it seems like we'll rescan the tree, so it doesn't seem like we're depending on the scan lock. Some relevant code fragments and comments: BTreeController.doIns(): while (true) { . . . // Row locking - first lock row previous to row being inserted: . . . if (latch_released) { // Had to release latch in order to get the lock, probably // because of a forward scanner, research tree, and try again. targetleaf = null; continue; } B2IRowLocking3._lockScanRow(): if (pos.current_slot == 0) { // this call will take care of searching left in the btree // to find the previous row to lock, 0 is the control row and // not a valid thing to lock as a previous key. // it is ok to call the non-scan as this is just a special // case of a previous key lock call. The only scan code that // will call this routine with slot == 0 will retry if this // routine returns that a latch was released. latch_released = !lockNonScanPreviousRow( B2IRowLocking3.lockNonScanPreviousRow(): // RESOLVE RLL (mikem) - NO RECORD_ID PROTECTION IN EFFECT. // caller must research, get new locks if this routine // releases latches. ret_status = this.searchLeftAndLockPreviousKey( B2IRowLocking3.searchLeftAndLockPreviousKey(): If along the search a latch has to be waited on then latches are released and a wait is performed, and "false" status is returned to caller. In this case the routine can no longer be sure of it's current position and may have to retry the whole operation.
          Hide
          Mike Matrigali added a comment -

          sorry for late reply, was out and away from computer end of the year.
          > I think I understand what you're suggesting, but a couple of questions
          > just to make sure that I'm not missing the point:
          >
          > - the purpose of this test is to check that we don't perform any
          > repositioning by key just because there has been an update on the
          > page?
          yes. I must be misunderstanding the current patch, I thought it would
          research on an update of the page. probably worth measuring but not a lot
          of different cases if the design does not expect a research. I was trying
          to force a measurement of worst case where a search would happen after every
          row.
          >
          > - by making sure that the client doesn't cache the values, you
          > basically mean that bulk fetch should be disabled? So setting
          > derby.language.bulkFetchDefault or perhaps using an updatable
          > cursor and positioned updates should do the trick?
          My comment was about making the tree bigger. Not necessarily to not cache
          the tree, but by making there be more levels in the tree then it will cost
          more to research the tree from the top. Again was just wanting to know the
          worst case penalty. Maybe you could add a short comment about in which
          cases we will research the tree with the new scheme vs. the old scheme.

          The test I was suggesting I think could be done with updatable cursor and
          positioned updates, as long as the properties of the cursor are such that
          each next goes to the database. There are some cursor settings where
          caching will happen in the client and nothing you do to the database after
          each row fetch will get you into the btree research code that we are trying
          to exercise.

          Show
          Mike Matrigali added a comment - sorry for late reply, was out and away from computer end of the year. > I think I understand what you're suggesting, but a couple of questions > just to make sure that I'm not missing the point: > > - the purpose of this test is to check that we don't perform any > repositioning by key just because there has been an update on the > page? yes. I must be misunderstanding the current patch, I thought it would research on an update of the page. probably worth measuring but not a lot of different cases if the design does not expect a research. I was trying to force a measurement of worst case where a search would happen after every row. > > - by making sure that the client doesn't cache the values, you > basically mean that bulk fetch should be disabled? So setting > derby.language.bulkFetchDefault or perhaps using an updatable > cursor and positioned updates should do the trick? My comment was about making the tree bigger. Not necessarily to not cache the tree, but by making there be more levels in the tree then it will cost more to research the tree from the top. Again was just wanting to know the worst case penalty. Maybe you could add a short comment about in which cases we will research the tree with the new scheme vs. the old scheme. The test I was suggesting I think could be done with updatable cursor and positioned updates, as long as the properties of the cursor are such that each next goes to the database. There are some cursor settings where caching will happen in the client and nothing you do to the database after each row fetch will get you into the btree research code that we are trying to exercise.
          Hide
          Mike Matrigali added a comment -

          concerning the latest performance results, I am still concerned as I believe the bulkfetch=1 case simulates the path that nested loop joins take, so may affect a lot of normal query behavior. But your latest optimization looks promising, I would not be surprised if
          there was some optimization that could be made in the copying of one DataValueDescriptor to another.

          Do you know if in the char/varchar case if the source DVD is already in String form (vs raw character array).

          Show
          Mike Matrigali added a comment - concerning the latest performance results, I am still concerned as I believe the bulkfetch=1 case simulates the path that nested loop joins take, so may affect a lot of normal query behavior. But your latest optimization looks promising, I would not be surprised if there was some optimization that could be made in the copying of one DataValueDescriptor to another. Do you know if in the char/varchar case if the source DVD is already in String form (vs raw character array).
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for clarifying the comments about the tests, Mike. I'll make
          the modifications and come back with the results.

          > Maybe you could add a short comment about in which cases we will
          > research the tree with the new scheme vs. the old scheme.

          The old scheme (current Derby) will research the tree in these
          situations:

          a) Locks (scan lock, row lock or prev key lock) could not be obtained
          immediately when positioning at the start of the scan

          b) Transaction was committed after we released the latch (then
          there's no scan lock protecting the position anymore)

          c) Some other operation within the same transaction caused a split
          on the page we were positioned on after we released the latch
          (scan lock doesn't protect us against changes in the same
          transaction)

          The new scheme will research the tree in situation (a) above, and
          additionally:

          d) A row was moved off the page after we released the latch (caused
          by split or purge)

          e) The current page was updated (any update, as long as the page
          version was changed) and evicted from the page cache after we
          released the latch

          > Do you know if in the char/varchar case if the source DVD is already
          > in String form (vs raw character array).

          It is still in raw form when the value is copied, but it will be
          converted to string form SQLChar.setFrom(DVD) calls getString() on the
          DVD. In the tests I've run till now, it shouldn't matter since we'll
          end up calling DVD.getString() from EmbedRS.getString() on all the
          strings anyway, and the two SQLChar instances share the same immutable
          String instance. In other cases it could matter, so it would be better
          if we could share the char array between the two SQLChar objects and
          avoid the allocation of the String. Not sure if this is safe,
          though. SQLChar treats the raw char array as an immutable data type in
          most cases, but not in normalize().

          It's the same with the DECIMAL tests. The source DVD is in raw form
          (byte[] + int) and is converted to a BigDecimal that's shared between
          the two SQLDecimal instances.

          The cheapest way to copy the key would probably be to have a method in
          the Page interface that could copy the raw row to a byte array. We
          would only have to deserialize the key if repositioning was needed,
          and we would only need one extra object allocation per scan in the
          normal case (unless the byte buffer we allocated the first time was
          too small). Is that an option? The methods to serialize and
          deserialize DVDs are public methods of all DVDs (writeExternal() and
          writeExternal()), so it's not really like it would be exposing
          implementation details in interface methods.

          Show
          Knut Anders Hatlen added a comment - Thanks for clarifying the comments about the tests, Mike. I'll make the modifications and come back with the results. > Maybe you could add a short comment about in which cases we will > research the tree with the new scheme vs. the old scheme. The old scheme (current Derby) will research the tree in these situations: a) Locks (scan lock, row lock or prev key lock) could not be obtained immediately when positioning at the start of the scan b) Transaction was committed after we released the latch (then there's no scan lock protecting the position anymore) c) Some other operation within the same transaction caused a split on the page we were positioned on after we released the latch (scan lock doesn't protect us against changes in the same transaction) The new scheme will research the tree in situation (a) above, and additionally: d) A row was moved off the page after we released the latch (caused by split or purge) e) The current page was updated (any update, as long as the page version was changed) and evicted from the page cache after we released the latch > Do you know if in the char/varchar case if the source DVD is already > in String form (vs raw character array). It is still in raw form when the value is copied, but it will be converted to string form SQLChar.setFrom(DVD) calls getString() on the DVD. In the tests I've run till now, it shouldn't matter since we'll end up calling DVD.getString() from EmbedRS.getString() on all the strings anyway, and the two SQLChar instances share the same immutable String instance. In other cases it could matter, so it would be better if we could share the char array between the two SQLChar objects and avoid the allocation of the String. Not sure if this is safe, though. SQLChar treats the raw char array as an immutable data type in most cases, but not in normalize(). It's the same with the DECIMAL tests. The source DVD is in raw form (byte[] + int) and is converted to a BigDecimal that's shared between the two SQLDecimal instances. The cheapest way to copy the key would probably be to have a method in the Page interface that could copy the raw row to a byte array. We would only have to deserialize the key if repositioning was needed, and we would only need one extra object allocation per scan in the normal case (unless the byte buffer we allocated the first time was too small). Is that an option? The methods to serialize and deserialize DVDs are public methods of all DVDs (writeExternal() and writeExternal()), so it's not really like it would be exposing implementation details in interface methods.
          Hide
          Knut Anders Hatlen added a comment -

          Thinking more about it, copying the raw bytes from the page array would probably only be cheaper if the materialized values created by copying the keys were not going to be used later. So in the tests that I have run till now it shouldn't be any cheaper at all, since the objects we allocate when we copy the key would have been allocated later anyway. Might be worth a try if everything else fails, though.

          Show
          Knut Anders Hatlen added a comment - Thinking more about it, copying the raw bytes from the page array would probably only be cheaper if the materialized values created by copying the keys were not going to be used later. So in the tests that I have run till now it shouldn't be any cheaper at all, since the objects we allocate when we copy the key would have been allocated later anyway. Might be worth a try if everything else fails, though.
          Hide
          Knut Anders Hatlen added a comment -

          There must be something wrong with the test results I posted on
          27/Dec/08.

          First of all, they don't really make any sense. Why should the
          overhead of saving the position be greater for VARCHAR(1000) than for
          VARCHAR(10)? Since the String object that we generate when we copy the
          key is shared between the source DVD and the target DVD, and the
          source DVD saves the same amount of work when it later needs to return
          the String to the user, the overhead per row should be proportional to
          the number of columns in the key, not to the size of the columns in
          the key. And since reading longer values is more expensive than
          reading shorter values, the relative overhead should be smaller rather
          than greater for VARCHAR(1000).

          Secondly, I see the exact same results when I compare a clean trunk
          with another clean trunk. For some reason, the runs (1, 3, 5, ...)
          consistently show significantly better performance than the runs (2,
          4, 6, ...). But it turns out that if I remove the database directory
          between each run, I get much more stable results. I've got no idea why
          this happens. Perhaps something CleanDatabaseTestSetup or some other
          part of the test framework does?

          Anyway, no matter what's causing it, I'll need to rerun the tests and
          let each test run create its own database. It's probably also a good
          idea to randomize the order of the test to eliminate interference from
          periodic/alternating factors. In the previous test runs, I ran clean,
          patched, clean, patched, and so on, so that all the bad runs were with
          patched jars and all the good runs were with clean jars. Perhaps I'll
          also write a simple standalone test, so that we don't need to worry
          about what happens in the test framework.

          Show
          Knut Anders Hatlen added a comment - There must be something wrong with the test results I posted on 27/Dec/08. First of all, they don't really make any sense. Why should the overhead of saving the position be greater for VARCHAR(1000) than for VARCHAR(10)? Since the String object that we generate when we copy the key is shared between the source DVD and the target DVD, and the source DVD saves the same amount of work when it later needs to return the String to the user, the overhead per row should be proportional to the number of columns in the key, not to the size of the columns in the key. And since reading longer values is more expensive than reading shorter values, the relative overhead should be smaller rather than greater for VARCHAR(1000). Secondly, I see the exact same results when I compare a clean trunk with another clean trunk. For some reason, the runs (1, 3, 5, ...) consistently show significantly better performance than the runs (2, 4, 6, ...). But it turns out that if I remove the database directory between each run, I get much more stable results. I've got no idea why this happens. Perhaps something CleanDatabaseTestSetup or some other part of the test framework does? Anyway, no matter what's causing it, I'll need to rerun the tests and let each test run create its own database. It's probably also a good idea to randomize the order of the test to eliminate interference from periodic/alternating factors. In the previous test runs, I ran clean, patched, clean, patched, and so on, so that all the bad runs were with patched jars and all the good runs were with clean jars. Perhaps I'll also write a simple standalone test, so that we don't need to worry about what happens in the test framework.
          Hide
          Knut Anders Hatlen added a comment -

          Had the tests with bulk-fetch disabled running over the weekend (200
          times with a clean trunk, 200 times patched). Now I used a standalone
          test which did the same as the JUnit test (except from some adjusting
          of the number of iterations to get the time more evenly distributed
          between the test cases, and the size of the test table was increased
          from 1000 rows to 10000 rows), and I randomized the order of the
          tests. Now the test results look much more reasonable.

          The test case with a key consisting of ten DECIMAL columns shows the
          worst performance, with a 1.9% increase in the time spent. The test
          cases with VARCHAR(10) and VARCHAR(100) showed 1.6% and 0.9% increase
          in time spent, respectively. The other test cases basically showed the
          same performance for clean jars and patched jars. The table below
          shows the average numbers (times in milliseconds) for the test runs.

          NAME |TRUNK_TIME |PATCHED_TI&|INCREASE |INC_PERCENT
          --------------------------------------------------------------------------
          Decimal10Columns |81495 |83025 |1530 |1.8774158
          DecimalSingleColumn |63722 |63769 |47 |0.07375789
          Varchar0010 |54268 |55145 |877 |1.6160537
          Varchar0100 |70907 |71576 |669 |0.9434894
          Varchar1000 |87431 |87457 |26 |0.029737735
          VarcharAll |97197 |96845 |-352 |-0.3621511

          Now that I've got test results that I understand, I'll go on and add
          the test cases Mike suggested. If we don't find any common code path
          that's slowed down more than what the latest test run showed, I would
          be inclined to say the overhead is acceptable, given that it fixes a
          serious issue and that other common code paths are actually getting
          better performance, and investigation on how to reduce the overhead
          further could be postponed until later. But let me first see how the
          other suggested test cases are affected.

          Show
          Knut Anders Hatlen added a comment - Had the tests with bulk-fetch disabled running over the weekend (200 times with a clean trunk, 200 times patched). Now I used a standalone test which did the same as the JUnit test (except from some adjusting of the number of iterations to get the time more evenly distributed between the test cases, and the size of the test table was increased from 1000 rows to 10000 rows), and I randomized the order of the tests. Now the test results look much more reasonable. The test case with a key consisting of ten DECIMAL columns shows the worst performance, with a 1.9% increase in the time spent. The test cases with VARCHAR(10) and VARCHAR(100) showed 1.6% and 0.9% increase in time spent, respectively. The other test cases basically showed the same performance for clean jars and patched jars. The table below shows the average numbers (times in milliseconds) for the test runs. NAME |TRUNK_TIME |PATCHED_TI&|INCREASE |INC_PERCENT -------------------------------------------------------------------------- Decimal10Columns |81495 |83025 |1530 |1.8774158 DecimalSingleColumn |63722 |63769 |47 |0.07375789 Varchar0010 |54268 |55145 |877 |1.6160537 Varchar0100 |70907 |71576 |669 |0.9434894 Varchar1000 |87431 |87457 |26 |0.029737735 VarcharAll |97197 |96845 |-352 |-0.3621511 Now that I've got test results that I understand, I'll go on and add the test cases Mike suggested. If we don't find any common code path that's slowed down more than what the latest test run showed, I would be inclined to say the overhead is acceptable, given that it fixes a serious issue and that other common code paths are actually getting better performance, and investigation on how to reduce the overhead further could be postponed until later. But let me first see how the other suggested test cases are affected.
          Hide
          Knut Anders Hatlen added a comment -

          Updating the preview patch (1e). There are two changes from the previous preview:

          1) If the leaf page on which we're positioned is transformed into a branch page while we don't hold the latch, we reposition by key. This fixes eight regression test failures (ClassCastExceptions) seen with 1d. FWIW, derbyall and suites.All now run without failures.

          2) Copy the (possibly partial) key fetched in BTreeForwardScan when saving the position, instead of re-reading the full key from the page. If the scan just fetches a partial key, we fetch the missing parts only from the page.

          This is not the same patch as I used in the latest performance test runs. That patch copied the key as if the full key had been fetched by the scan. However, since the tests only scan the index and don't go to the base table, the scans don't fetch the RowLocation which is the last column in the index. So when we save the position we still need to call Page.fetchFromSlot() to get the RowLocation. So now the overhead with bulk fetch disabled is back at 1% (for large key columns) to 10% (small/many key columns). I'm posting the patch for reference anyway.

          Show
          Knut Anders Hatlen added a comment - Updating the preview patch (1e). There are two changes from the previous preview: 1) If the leaf page on which we're positioned is transformed into a branch page while we don't hold the latch, we reposition by key. This fixes eight regression test failures (ClassCastExceptions) seen with 1d. FWIW, derbyall and suites.All now run without failures. 2) Copy the (possibly partial) key fetched in BTreeForwardScan when saving the position, instead of re-reading the full key from the page. If the scan just fetches a partial key, we fetch the missing parts only from the page. This is not the same patch as I used in the latest performance test runs. That patch copied the key as if the full key had been fetched by the scan. However, since the tests only scan the index and don't go to the base table, the scans don't fetch the RowLocation which is the last column in the index. So when we save the position we still need to call Page.fetchFromSlot() to get the RowLocation. So now the overhead with bulk fetch disabled is back at 1% (for large key columns) to 10% (small/many key columns). I'm posting the patch for reference anyway.
          Hide
          Knut Anders Hatlen added a comment -

          I've tried to find out which effect the proposed solution has on
          nested loop joins. The assumption was earlier that nested loop joins
          would turn off bulk fetching and therefore the index scans with
          derby.language.bulkFetchDefault=1 would give a good impression of the
          overhead imposed by the patch.

          This turns out not to be the case. Nested loop joins do in fact use
          the default bulk fetch size (16) both for accesses to the outer table
          and to the inner table. The inner table/index is accessed through many
          shorter scans that frequently return fewer rows than the bulk fetch
          size. But even if that causes frequent release of latches, it doesn't
          cause copying of the current key, since the scan is done when the
          latch is released, and therefore we don't need the position
          anymore. So the nested loop joins will actually get much of the
          performance benefit observed in single-record primary key selects
          mentioned when the preview-1c patch was uploaded.

          I inserted some optimizer overrides in the index_join test in
          org.apache.derbyTesting.perf.clients.Runner to make the optimizer pick
          a nested loop join. With the modified test, the throughput appeared to
          increase by about 6% when the patch was applied (average of 30 runs
          with each configuration).

          So here's what it looks to me as if the performance impact is:

          a) Short index scans get increased performance because they don't need
          to obtain the scan lock, and because saving the position is not needed
          since they complete before they release the latch.

          b) Long index scans don't get very much affected by the patch (there's
          extra cost in saving keys, but that only happens for every 16th row,
          and there's a per page reduction in cost by obtaining the scan lock)

          c) Nested loop joins use (b) to scan the outer table and (a) to scan
          the inner table, so they should not see any negative impact

          d) Long index scans without bulk fetching get lower performance
          because they save the position for every qualified row in the index

          So the only case identified so far which will get lower performance,
          is (d) long index scans without bulk fetching. This kind of scan may
          be used by updatable cursors, but I'm not aware of any other kind of
          queries that would disable bulk fetching (except when users set
          derby.language.bulkFetchDefault, but I don't think that's a very
          important case). The overhead appears to be in the range 1%-10%,
          depending on the key.

          Unless there are other common cases that I haven't thought of, it
          sounds like an acceptable overhead to me, given that

          a) updatable cursors aren't all that common, and if they are actually
          used to update the database, the extra CPU spent by the scan will be
          negligible

          b) shorter scans see a performance increase in the same order as the
          decrease seen by longer no-bulk scans

          c) concurrent reads and writes don't deadlock anymore unless there's a
          real lock conflict

          Comments?

          Show
          Knut Anders Hatlen added a comment - I've tried to find out which effect the proposed solution has on nested loop joins. The assumption was earlier that nested loop joins would turn off bulk fetching and therefore the index scans with derby.language.bulkFetchDefault=1 would give a good impression of the overhead imposed by the patch. This turns out not to be the case. Nested loop joins do in fact use the default bulk fetch size (16) both for accesses to the outer table and to the inner table. The inner table/index is accessed through many shorter scans that frequently return fewer rows than the bulk fetch size. But even if that causes frequent release of latches, it doesn't cause copying of the current key, since the scan is done when the latch is released, and therefore we don't need the position anymore. So the nested loop joins will actually get much of the performance benefit observed in single-record primary key selects mentioned when the preview-1c patch was uploaded. I inserted some optimizer overrides in the index_join test in org.apache.derbyTesting.perf.clients.Runner to make the optimizer pick a nested loop join. With the modified test, the throughput appeared to increase by about 6% when the patch was applied (average of 30 runs with each configuration). So here's what it looks to me as if the performance impact is: a) Short index scans get increased performance because they don't need to obtain the scan lock, and because saving the position is not needed since they complete before they release the latch. b) Long index scans don't get very much affected by the patch (there's extra cost in saving keys, but that only happens for every 16th row, and there's a per page reduction in cost by obtaining the scan lock) c) Nested loop joins use (b) to scan the outer table and (a) to scan the inner table, so they should not see any negative impact d) Long index scans without bulk fetching get lower performance because they save the position for every qualified row in the index So the only case identified so far which will get lower performance, is (d) long index scans without bulk fetching. This kind of scan may be used by updatable cursors, but I'm not aware of any other kind of queries that would disable bulk fetching (except when users set derby.language.bulkFetchDefault, but I don't think that's a very important case). The overhead appears to be in the range 1%-10%, depending on the key. Unless there are other common cases that I haven't thought of, it sounds like an acceptable overhead to me, given that a) updatable cursors aren't all that common, and if they are actually used to update the database, the extra CPU spent by the scan will be negligible b) shorter scans see a performance increase in the same order as the decrease seen by longer no-bulk scans c) concurrent reads and writes don't deadlock anymore unless there's a real lock conflict Comments?
          Hide
          Bryan Pendleton added a comment -

          I think your performance testing has been superb: detailed and thorough.

          I'm comfortable with the performance impact of this change.

          The deadlock is a serious problem, experienced by many users, and I think
          this proposed fix is great. +1 from me to proceed with committing your patch.
          Let's get it into the codeline and start getting some wider experience with it.

          Show
          Bryan Pendleton added a comment - I think your performance testing has been superb: detailed and thorough. I'm comfortable with the performance impact of this change. The deadlock is a serious problem, experienced by many users, and I think this proposed fix is great. +1 from me to proceed with committing your patch. Let's get it into the codeline and start getting some wider experience with it.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for your comments, Bryan. I agree that if we go for this approach, it is better to get the code in early to get as much testing as possible before the next release. But I think the preview patches need to be cleaned up and have some more comments before they're ready to be committed. I'm also about to start writing more functional tests, as I don't trust that the existing tests exercise all the new code paths. At the very least, I would like to have tests for all calls to reposition() in situations where the calls actually lead to a full repositioning from the root of the B-tree. Most of those cases are not tested by the existing tests because they would likely have led to timeouts.

          Show
          Knut Anders Hatlen added a comment - Thanks for your comments, Bryan. I agree that if we go for this approach, it is better to get the code in early to get as much testing as possible before the next release. But I think the preview patches need to be cleaned up and have some more comments before they're ready to be committed. I'm also about to start writing more functional tests, as I don't trust that the existing tests exercise all the new code paths. At the very least, I would like to have tests for all calls to reposition() in situations where the calls actually lead to a full repositioning from the root of the B-tree. Most of those cases are not tested by the existing tests because they would likely have led to timeouts.
          Hide
          Mike Matrigali added a comment -

          I believe we should go forward with this change in the trunk. I also am ok with the tested
          performance trade off's. I don't think we should backport a change of this magnitude, I think
          it is appropriate for a new feature release, hopefully 10.5. I especially like that after this change
          the code is simpler, the scan locking stuff was complicated and it is great that it can be removed
          with performance sometimes increased and mostly not too affected.

          Once the change goes in there may be more work possible to improve performance, but I think
          it is fine to get the basic stuff in now. One thing that comes to mind is to change the default group
          fetch size from a fixed size to a page worth. That had always been a future direction and I think the
          interfaces are there (ie. allow store to set the size of the fetch, and passing variable size groups back
          to caller). This seems even more important as once the latch is given up repositioning costs may
          be higher than before.

          Knut, I would be happy to review the entire package one more time, but would rather wait until you do
          the cleanup you mention. Just post a comment when you are ready.

          Show
          Mike Matrigali added a comment - I believe we should go forward with this change in the trunk. I also am ok with the tested performance trade off's. I don't think we should backport a change of this magnitude, I think it is appropriate for a new feature release, hopefully 10.5. I especially like that after this change the code is simpler, the scan locking stuff was complicated and it is great that it can be removed with performance sometimes increased and mostly not too affected. Once the change goes in there may be more work possible to improve performance, but I think it is fine to get the basic stuff in now. One thing that comes to mind is to change the default group fetch size from a fixed size to a page worth. That had always been a future direction and I think the interfaces are there (ie. allow store to set the size of the fetch, and passing variable size groups back to caller). This seems even more important as once the latch is given up repositioning costs may be higher than before. Knut, I would be happy to review the entire package one more time, but would rather wait until you do the cleanup you mention. Just post a comment when you are ready.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks Mike. I agree that this change should not be back-ported. Increasing the
          fetch size sounds like a good idea to me. I guess we should collect this and
          other ideas in JIRA issues before we close this issue. I'll post a comment when
          I have a cleaned up patch ready.

          Show
          Knut Anders Hatlen added a comment - Thanks Mike. I agree that this change should not be back-ported. Increasing the fetch size sounds like a good idea to me. I guess we should collect this and other ideas in JIRA issues before we close this issue. I'll post a comment when I have a cleaned up patch ready.
          Hide
          Knut Anders Hatlen added a comment -

          Here's the first attempt to create a test that exercises the new code. It's only got two test cases for now, but I'm posting it anyway. More test cases will come.

          Test case 1 tests the call to BTreeScan.reposition() in BTreeMaxScan.fetchMaxFromBeginning().
          (BTreeMaxScan has another call to reposition() in fetchMax(), but as far as I can see it's impossible to reach that call, so I haven't added any test case for it.)

          Test case 2 tests the first call to reposition() in BTreeForwardScan.fetchRows() (there are four more in that method) when the leaf page on which the scan is positioned has been split after the position was saved. (Full repositioning from the root of the B-tree is required in this case.)

          #1 fails with the patch (assert in sane builds, NPE in insane builds), so there's more to investigate. It runs cleanly without the patch.

          #2 times out without the patch, and runs successfully with the patch.

          Committed revision 745866.

          Show
          Knut Anders Hatlen added a comment - Here's the first attempt to create a test that exercises the new code. It's only got two test cases for now, but I'm posting it anyway. More test cases will come. Test case 1 tests the call to BTreeScan.reposition() in BTreeMaxScan.fetchMaxFromBeginning(). (BTreeMaxScan has another call to reposition() in fetchMax(), but as far as I can see it's impossible to reach that call, so I haven't added any test case for it.) Test case 2 tests the first call to reposition() in BTreeForwardScan.fetchRows() (there are four more in that method) when the leaf page on which the scan is positioned has been split after the position was saved. (Full repositioning from the root of the B-tree is required in this case.) #1 fails with the patch (assert in sane builds, NPE in insane builds), so there's more to investigate. It runs cleanly without the patch. #2 times out without the patch, and runs successfully with the patch. Committed revision 745866.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch with more test cases for the calls to reposition() in BTreeForwardScan.fetchRows(). None of the added test cases fail on a clean trunk. One of them fails with the preview-1e patch. This is an expected failure (there's a TODO mentioning it in reposition()) where the page has disappeared because of an in-place compress before resuming the scan with a holdable cursor. It fails with an assert in sane builds when calling ControlRow.get(). I had expected it to work in insane builds, but then get() throws a NullPointerException. I guess this should either be resolved by allowing ControlRow.get() to return null when the page has disappeared, or to create a variant of ControlRow.get() which is allowed to return null.

          Committed the new test cases to trunk with revision 746273.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch with more test cases for the calls to reposition() in BTreeForwardScan.fetchRows(). None of the added test cases fail on a clean trunk. One of them fails with the preview-1e patch. This is an expected failure (there's a TODO mentioning it in reposition()) where the page has disappeared because of an in-place compress before resuming the scan with a holdable cursor. It fails with an assert in sane builds when calling ControlRow.get(). I had expected it to work in insane builds, but then get() throws a NullPointerException. I guess this should either be resolved by allowing ControlRow.get() to return null when the page has disappeared, or to create a variant of ControlRow.get() which is allowed to return null. Committed the new test cases to trunk with revision 746273.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch with two more test cases for BTreeForwardScan.fetchRows(). The test cases exercise the code path taken when an index scan waits for a lock and the current leaf page has been split before it wakes up. One of the test cases is for unique indexes, and the other one is for non-unique indexes (two test cases were needed since the different indexes call reposition() from different places). Both of the test cases fail (lock timeout) on a clean trunk and pass with patch 1e.

          Committed revision 746954.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch with two more test cases for BTreeForwardScan.fetchRows(). The test cases exercise the code path taken when an index scan waits for a lock and the current leaf page has been split before it wakes up. One of the test cases is for unique indexes, and the other one is for non-unique indexes (two test cases were needed since the different indexes call reposition() from different places). Both of the test cases fail (lock timeout) on a clean trunk and pass with patch 1e. Committed revision 746954.
          Hide
          Knut Anders Hatlen added a comment -

          I've been trying to write test cases for the rest of the calls to BTreeScan.reposition(), but I haven't managed to come up with any sensible test cases for IndexSplitDeadlockTest. Some of the calls appear to be unreachable unless we use the internal API. Others are reachable, but in order to test them better than they're already tested by other tests (for instance testing that they work if there's been a split right before the repositioning) I think we'll need to write tests against the internal API for those calls as well. I'll defer writing tests against the internal API for now (of course, suggestions of how to test them via the public API are welcome) and instead add more test cases for the places we are saving the position. I added a comment to IndexSplitDeadlockTest describing which calls to reposition() that are not explicitly tested (revision 747371).

          Show
          Knut Anders Hatlen added a comment - I've been trying to write test cases for the rest of the calls to BTreeScan.reposition(), but I haven't managed to come up with any sensible test cases for IndexSplitDeadlockTest. Some of the calls appear to be unreachable unless we use the internal API. Others are reachable, but in order to test them better than they're already tested by other tests (for instance testing that they work if there's been a split right before the repositioning) I think we'll need to write tests against the internal API for those calls as well. I'll defer writing tests against the internal API for now (of course, suggestions of how to test them via the public API are welcome) and instead add more test cases for the places we are saving the position. I added a comment to IndexSplitDeadlockTest describing which calls to reposition() that are not explicitly tested (revision 747371).
          Hide
          Knut Anders Hatlen added a comment -

          I'm attaching a new patch (d2991-2a.diff) which has these changes
          compared to the preview-1e patch:

          • added more comments and removed mentioning of scan lock in (some of
            the) existing comments
          • removed the savedLockedPos flag from BTreeRowPosition and moved back
            to the original approach of just using current_rh != null to
            indicate that the row on the current position was locked. This also
            allows cheap repositioning by record id in more cases than before
            because current_rh is not null as often as it was with the
            preview-1e patch
          • fixed the bug exposed by testBTreeMaxScan_fetchMaxRowFromBeginning
            in IndexSplitDeadlockTest. The problem was that one of the methods
            was passed the scan_position instance variable as an argument. When
            that method called reopenScan(), the instance variable would be
            replaced, but the local variable would remain the same, and this
            inconsistency triggered an assert in the added code. Solved by
            removing the position from the argument list and instead using the
            instance variable directly
          • fixed the bug exposed by
            testBTreeForwardScan_fetchRows_resumeScanAfterCompress. Solved by
            fetching the page with another method that didn't fail if the page
            had disappeared, and reposition from the root of the B-tree if the
            page was removed by SYSCS_COMPRESS_TABLE.
          • updated canon for store/updatelocksJDBC30.sql which has been added
            to derbyall recently

          I have just realized that there are some more problems that need to be
          resolved, but I think they will have just minor effect on the patch,
          so I'm posting it anyway to allow others to take a look at it and see
          if there are more serious problems with it.

          Here's a description of the changes in each file touched by the patch
          (excluding the test files which were just updated so that they didn't
          expect scan locks in the lock tables, and so that they didn't expect
          index split deadlocks):

          • impl/store/access/sort/Scan.java

          Removed savePosition() method from the interface because the position
          is now always saved by key when the scan doesn't hold a latch on the
          leaf.

          • impl/store/access/btree/LeafControlRow.java

          Removed calls to Scan.saveScanPositions() and
          BTreeLockingPolicy.lockScan() before splitting the page, because
          positions are always saved by the scan itself now, and because we
          don't use scan locks anymore.

          Set a hint in the page object after splitting to notify the scans that
          they must reposition from the root of the B-tree after a split.

          • impl/store/access/btree/BTreeController.java

          Don't lock scan before purging committed deletes. Set a hint in the
          page object to tell the scans that they must reposition from the root
          of the B-tree since the row may not be there anymore.

          • impl/store/access/btree/BTreeScan.java

          Remove locking/unlocking of scan.

          Update savePosition() to allow the position to be saved by record id
          and by key at the same time, and make it possible to pass in a partial
          or a full key to reduce the number of slots that must be fetched from
          the page.

          Update reposition() to reposition by record id if possible and by key
          if needed. Repositioning by record id (which is cheaper) is possible
          if the row is guaranteed to be on the same page as when the position
          was saved (that is, no split or purge operation has been performed on
          the page after saving the position).

          Use savePositionAndReleasePage() when returning from the scan in
          delete(), fetch(), doesCurrentPositionQualify() and
          isCurrentPositionDeleted().

          • impl/store/access/btree/BTreeMaxScan.java

          Remove the BTreeRowPosition argument from fetchMaxRowFromBeginning()
          to prevent that it goes out of sync with the instance variable
          scan_position when reopenScan() is called. Use the fresh instance
          variable instead.

          Remove references to the scan protection handle.

          • impl/store/access/btree/BTreeRowPosition.java

          Add more state (and methods to access the state):

          • parent of the position. That is, which scan owns this
            position. This was needed to allow the position to be saved from
            some methods that didn't know which scan it belonged to. Could
            alternatively be solved by passing the scan as an argument to
            those methods (or actually to those methods and their callers)
            which is probably cleaner, but could be performed in a later
            clean-up
          • version number of the current leaf page when the position was
            saved (used to determine whether full repositioning is needed
            because of split, etc.)
          • template for the key to save (to prevent allocation each time the
            key is saved). When the position is saved by key, this is
            identical to current_positionKey. A separate field was added so
            that we keep the object even when current_positionKey is nulled
            out, but a possibly cleaner solution would be to have just a flag
            telling whether the value in current_positionKey is valid, and
            never reset current_positionKey to null. Could be done in a later
            clean-up
          • fetch descriptor used to fetch the rest of the key in the cases
            where the scan has already fetched parts of the key before saving
            the position
          • impl/store/access/btree/BTreePostCommit.java

          purgeRowLevelCommittedDeletes() sets a hint in the page object to
          force scans to reposition from the root of the B-tree when at least
          one row has been purged from the page.

          I think the same change should have been made in
          purgeCommittedDeletes(). I missed it because the method assumed an
          exclusive table lock and therefore didn't need a scan lock. Will
          update the patch later.

          • impl/store/access/btree/OpenBTree.java

          Remove references to the removed saveScanPositions() method and to the
          protection record handle.

          Make the debug code that simulates release of latches save the
          position since that's what happens if the latches really are released
          by the production code now.

          • impl/store/access/btree/index/B2IRowLockingRR.java
          • impl/store/access/btree/index/B2INoLocking.java
          • impl/store/access/btree/index/B2IRowLocking1.java
          • impl/store/access/btree/index/B2IRowLocking3.java
          • impl/store/access/btree/BTreeLockingPolicy.java

          Remove request_scan_lock parameter.

          Remove code to lock/unlock scan.

          Save position of scan if lock cannot be granted immediately.

          • impl/store/access/btree/BTreeForwardScan.java

          Save position by key each time the scan returns a group of rows. Use
          the partial (possibly full) key fetched by the scan to make the save
          position operation cheaper.

          • impl/store/access/RAMTransaction.java
          • impl/store/access/heap/HeapScan.java
          • iapi/store/access/conglomerate/ScanManager.java
          • iapi/store/access/conglomerate/TransactionManager.java

          Remove saveScanPositions()/savePosition() because the position will
          already have been saved now since we always save the position when we
          release the latch.

          • impl/store/access/heap/HeapRowLocation.java

          Remove THROWASSERT from and complete implementation of
          setFrom(DataValueDescriptor) to allow the RowLocation in the index row
          to be copied when we save the position.

          • impl/store/raw/data/BasePage.java
          • iapi/store/raw/Page.java

          Remove reference to protection record handle.

          Add code to set the hint that repositioning from the root of the
          B-tree is needed.

          • iapi/store/raw/RecordHandle.java

          Remove constant identifying the record protection handle that we no
          longer use.

          Show
          Knut Anders Hatlen added a comment - I'm attaching a new patch (d2991-2a.diff) which has these changes compared to the preview-1e patch: added more comments and removed mentioning of scan lock in (some of the) existing comments removed the savedLockedPos flag from BTreeRowPosition and moved back to the original approach of just using current_rh != null to indicate that the row on the current position was locked. This also allows cheap repositioning by record id in more cases than before because current_rh is not null as often as it was with the preview-1e patch fixed the bug exposed by testBTreeMaxScan_fetchMaxRowFromBeginning in IndexSplitDeadlockTest. The problem was that one of the methods was passed the scan_position instance variable as an argument. When that method called reopenScan(), the instance variable would be replaced, but the local variable would remain the same, and this inconsistency triggered an assert in the added code. Solved by removing the position from the argument list and instead using the instance variable directly fixed the bug exposed by testBTreeForwardScan_fetchRows_resumeScanAfterCompress. Solved by fetching the page with another method that didn't fail if the page had disappeared, and reposition from the root of the B-tree if the page was removed by SYSCS_COMPRESS_TABLE. updated canon for store/updatelocksJDBC30.sql which has been added to derbyall recently I have just realized that there are some more problems that need to be resolved, but I think they will have just minor effect on the patch, so I'm posting it anyway to allow others to take a look at it and see if there are more serious problems with it. Here's a description of the changes in each file touched by the patch (excluding the test files which were just updated so that they didn't expect scan locks in the lock tables, and so that they didn't expect index split deadlocks): impl/store/access/sort/Scan.java Removed savePosition() method from the interface because the position is now always saved by key when the scan doesn't hold a latch on the leaf. impl/store/access/btree/LeafControlRow.java Removed calls to Scan.saveScanPositions() and BTreeLockingPolicy.lockScan() before splitting the page, because positions are always saved by the scan itself now, and because we don't use scan locks anymore. Set a hint in the page object after splitting to notify the scans that they must reposition from the root of the B-tree after a split. impl/store/access/btree/BTreeController.java Don't lock scan before purging committed deletes. Set a hint in the page object to tell the scans that they must reposition from the root of the B-tree since the row may not be there anymore. impl/store/access/btree/BTreeScan.java Remove locking/unlocking of scan. Update savePosition() to allow the position to be saved by record id and by key at the same time, and make it possible to pass in a partial or a full key to reduce the number of slots that must be fetched from the page. Update reposition() to reposition by record id if possible and by key if needed. Repositioning by record id (which is cheaper) is possible if the row is guaranteed to be on the same page as when the position was saved (that is, no split or purge operation has been performed on the page after saving the position). Use savePositionAndReleasePage() when returning from the scan in delete(), fetch(), doesCurrentPositionQualify() and isCurrentPositionDeleted(). impl/store/access/btree/BTreeMaxScan.java Remove the BTreeRowPosition argument from fetchMaxRowFromBeginning() to prevent that it goes out of sync with the instance variable scan_position when reopenScan() is called. Use the fresh instance variable instead. Remove references to the scan protection handle. impl/store/access/btree/BTreeRowPosition.java Add more state (and methods to access the state): parent of the position. That is, which scan owns this position. This was needed to allow the position to be saved from some methods that didn't know which scan it belonged to. Could alternatively be solved by passing the scan as an argument to those methods (or actually to those methods and their callers) which is probably cleaner, but could be performed in a later clean-up version number of the current leaf page when the position was saved (used to determine whether full repositioning is needed because of split, etc.) template for the key to save (to prevent allocation each time the key is saved). When the position is saved by key, this is identical to current_positionKey. A separate field was added so that we keep the object even when current_positionKey is nulled out, but a possibly cleaner solution would be to have just a flag telling whether the value in current_positionKey is valid, and never reset current_positionKey to null. Could be done in a later clean-up fetch descriptor used to fetch the rest of the key in the cases where the scan has already fetched parts of the key before saving the position impl/store/access/btree/BTreePostCommit.java purgeRowLevelCommittedDeletes() sets a hint in the page object to force scans to reposition from the root of the B-tree when at least one row has been purged from the page. I think the same change should have been made in purgeCommittedDeletes(). I missed it because the method assumed an exclusive table lock and therefore didn't need a scan lock. Will update the patch later. impl/store/access/btree/OpenBTree.java Remove references to the removed saveScanPositions() method and to the protection record handle. Make the debug code that simulates release of latches save the position since that's what happens if the latches really are released by the production code now. impl/store/access/btree/index/B2IRowLockingRR.java impl/store/access/btree/index/B2INoLocking.java impl/store/access/btree/index/B2IRowLocking1.java impl/store/access/btree/index/B2IRowLocking3.java impl/store/access/btree/BTreeLockingPolicy.java Remove request_scan_lock parameter. Remove code to lock/unlock scan. Save position of scan if lock cannot be granted immediately. impl/store/access/btree/BTreeForwardScan.java Save position by key each time the scan returns a group of rows. Use the partial (possibly full) key fetched by the scan to make the save position operation cheaper. impl/store/access/RAMTransaction.java impl/store/access/heap/HeapScan.java iapi/store/access/conglomerate/ScanManager.java iapi/store/access/conglomerate/TransactionManager.java Remove saveScanPositions()/savePosition() because the position will already have been saved now since we always save the position when we release the latch. impl/store/access/heap/HeapRowLocation.java Remove THROWASSERT from and complete implementation of setFrom(DataValueDescriptor) to allow the RowLocation in the index row to be copied when we save the position. impl/store/raw/data/BasePage.java iapi/store/raw/Page.java Remove reference to protection record handle. Add code to set the hint that repositioning from the root of the B-tree is needed. iapi/store/raw/RecordHandle.java Remove constant identifying the record protection handle that we no longer use.
          Hide
          Knut Anders Hatlen added a comment -

          I know there are at least a couple of issues with the 2a patch, but
          I'm setting the patch available flag anyway because I believe those
          issues won't cause big changes to the patch, so the patch is in such a
          state that it should be ready for review.

          The issues I'm aware of that need to be resolved, are these:

          • BTreePostCommit.purgeCommittedDeletes() needs to call
            Page.setRepositionNeeded() to tell index scans that they need to
            reposition by key. Since purgeCommittedDeletes() is only called in a
            separate transaction that has an exclusive table lock, I believe
            that this issue could only affect holdable index scans that
            reposition after a commit.
          • Some callers of BTreeScan.reposition(pos,false) have
            comments/asserts stating that it is impossible that the current row
            has been purged because they hold the scan lock. I think that it is
            now possible that rows are purged from the page if we released the
            latch, so we might need to change how they handle that situation. In
            most cases (except read-uncommitted scans) the scans hold a lock on
            the current row so that it is true that the row cannot have been
            purged.

          Feedback on the patch as it is would be appreciated. I'll probably be
          offline for a couple of days, so I may not respond immediately to
          questions/comments.

          Show
          Knut Anders Hatlen added a comment - I know there are at least a couple of issues with the 2a patch, but I'm setting the patch available flag anyway because I believe those issues won't cause big changes to the patch, so the patch is in such a state that it should be ready for review. The issues I'm aware of that need to be resolved, are these: BTreePostCommit.purgeCommittedDeletes() needs to call Page.setRepositionNeeded() to tell index scans that they need to reposition by key. Since purgeCommittedDeletes() is only called in a separate transaction that has an exclusive table lock, I believe that this issue could only affect holdable index scans that reposition after a commit. Some callers of BTreeScan.reposition(pos,false) have comments/asserts stating that it is impossible that the current row has been purged because they hold the scan lock. I think that it is now possible that rows are purged from the page if we released the latch, so we might need to change how they handle that situation. In most cases (except read-uncommitted scans) the scans hold a lock on the current row so that it is true that the row cannot have been purged. Feedback on the patch as it is would be appreciated. I'll probably be offline for a couple of days, so I may not respond immediately to questions/comments.
          Hide
          Knut Anders Hatlen added a comment -

          Here's an updated patch (d2991-2b.diff) which addresses the two issues I
          mentioned that I was aware of in the 2a patch:

          1) Call Page.setRepositionNeeded() in BTreePostCommit.purgeCommittedDeletes()
          when a row has been purged.

          2) Handle the cases where reposition() can return false (that is, second
          argument to reposition() is false and the row on the current position has been
          purged). This led to the following changes:

          • BTreeScan.positionAtDoneScanFromClose()
          • BTreeScan.reopenScan()

          Removed the calls to reposition(). The only reason I could see for these
          methods to call reposition() was that some implementations of
          BTreeLockingPolicy.unlockScanRecordAfterRead() had asserts that checked that
          the page of the current position was latched. Removing the calls (and the
          asserts) made the code simpler and removed the need for special handling if
          reposition() was unsuccessful.

          • B2IRowLockingRR.unlockScanRecordAfterRead()
          • B2IRowLocking2.unlockScanRecordAfterRead()

          Don't assert that the current leaf is latched, as there is no need for that
          latch in order to unlock the record. (See above.)

          • BTreeScan.delete()
          • BTreeScan.doesCurrentPositionQualify()
          • BTreeScan.fetch()
          • BTreeScan.isCurrentPositionDeleted()

          Make sure that we don't try to release the latch on the current leaf unless
          we have actually latched it, since the leaf won't be latched if reposition()
          returns false. No other special handling of purged rows is needed in those
          methods, I think. delete() and fetch() throw an exception
          (AM_RECORD_NOT_FOUND) if the row has been purged, which sounds reasonable to
          me. doesCurrentPositionQualify() and isCurrentPositionDeleted() use the
          return value from reposition() to decide what they should return themselves,
          which also sounds fine to me (except that I would expect that
          isCurrentPositionDeleted() returned true if the row was purged, but currently
          it returns false – will file a separate bug for that).

          • BTreeMaxScan.fetchMaxRowFromBeginning()
          • BTreeForwardScan.fetchRows()

          If the row on the current position of the scan has been purged while we were
          waiting for a lock so that reposition(pos,false) returns false, we call
          reposition() again with second argument true to reposition on the row
          immediately to the left of where the purged row was supposed to be. This
          effectively takes one step back in the scan, so therefore we need to jump to
          the top of the loop's body to move one step forward past the purged row.

          I tested that reposition(pos,false) followed by reposition(pos,true) worked by
          setting a breakpoint in the debugger and manually changing values in the page
          object and in the position to make the scan code believe that the row had been
          purged. As far as I could tell, it worked just as if the scan had found a
          deleted row. (There are currently no tests that exercise code paths where
          reposition() returns false, and I don't see any easy way to write a test for it
          since it would be highly dependent on timing between user threads and service
          threads.)

          This patch fixes all the issues I'm aware of in the previous patch. Derbyall
          and suites.All ran cleanly. Reviews, comments and questions would be
          appreciated. Thanks.

          Show
          Knut Anders Hatlen added a comment - Here's an updated patch (d2991-2b.diff) which addresses the two issues I mentioned that I was aware of in the 2a patch: 1) Call Page.setRepositionNeeded() in BTreePostCommit.purgeCommittedDeletes() when a row has been purged. 2) Handle the cases where reposition() can return false (that is, second argument to reposition() is false and the row on the current position has been purged). This led to the following changes: BTreeScan.positionAtDoneScanFromClose() BTreeScan.reopenScan() Removed the calls to reposition(). The only reason I could see for these methods to call reposition() was that some implementations of BTreeLockingPolicy.unlockScanRecordAfterRead() had asserts that checked that the page of the current position was latched. Removing the calls (and the asserts) made the code simpler and removed the need for special handling if reposition() was unsuccessful. B2IRowLockingRR.unlockScanRecordAfterRead() B2IRowLocking2.unlockScanRecordAfterRead() Don't assert that the current leaf is latched, as there is no need for that latch in order to unlock the record. (See above.) BTreeScan.delete() BTreeScan.doesCurrentPositionQualify() BTreeScan.fetch() BTreeScan.isCurrentPositionDeleted() Make sure that we don't try to release the latch on the current leaf unless we have actually latched it, since the leaf won't be latched if reposition() returns false. No other special handling of purged rows is needed in those methods, I think. delete() and fetch() throw an exception (AM_RECORD_NOT_FOUND) if the row has been purged, which sounds reasonable to me. doesCurrentPositionQualify() and isCurrentPositionDeleted() use the return value from reposition() to decide what they should return themselves, which also sounds fine to me (except that I would expect that isCurrentPositionDeleted() returned true if the row was purged, but currently it returns false – will file a separate bug for that). BTreeMaxScan.fetchMaxRowFromBeginning() BTreeForwardScan.fetchRows() If the row on the current position of the scan has been purged while we were waiting for a lock so that reposition(pos,false) returns false, we call reposition() again with second argument true to reposition on the row immediately to the left of where the purged row was supposed to be. This effectively takes one step back in the scan, so therefore we need to jump to the top of the loop's body to move one step forward past the purged row. I tested that reposition(pos,false) followed by reposition(pos,true) worked by setting a breakpoint in the debugger and manually changing values in the page object and in the position to make the scan code believe that the row had been purged. As far as I could tell, it worked just as if the scan had found a deleted row. (There are currently no tests that exercise code paths where reposition() returns false, and I don't see any easy way to write a test for it since it would be highly dependent on timing between user threads and service threads.) This patch fixes all the issues I'm aware of in the previous patch. Derbyall and suites.All ran cleanly. Reviews, comments and questions would be appreciated. Thanks.
          Hide
          Knut Anders Hatlen added a comment -

          I would feel more comfortable with committing this patch if another pair of eyes had looked at it first, but since it was suggested earlier that it would be good to get the fix into the codeline as soon as possible to get more experience with it, I plan to commit the patch on Monday before the 10.5 branch is created unless I hear anything. Please let me know if you have objections to this approach. Of course, reviews and comments are welcome even after the patch has been committed.

          Show
          Knut Anders Hatlen added a comment - I would feel more comfortable with committing this patch if another pair of eyes had looked at it first, but since it was suggested earlier that it would be good to get the fix into the codeline as soon as possible to get more experience with it, I plan to commit the patch on Monday before the 10.5 branch is created unless I hear anything. Please let me know if you have objections to this approach. Of course, reviews and comments are welcome even after the patch has been committed.
          Hide
          Knut Anders Hatlen added a comment -

          Committed the 2b patch with revision 754894.

          There still are references to the scan locks in some comments I think. I will try to track them down and update them in a follow-up.

          Show
          Knut Anders Hatlen added a comment - Committed the 2b patch with revision 754894. There still are references to the scan locks in some comments I think. I will try to track them down and update them in a follow-up.
          Hide
          Knut Anders Hatlen added a comment -

          IndexSplitDeadlockTest has now been enabled as part of suites.All (patch test-4.diff) since the main fix has gone in. Committed revision 754900.

          Show
          Knut Anders Hatlen added a comment - IndexSplitDeadlockTest has now been enabled as part of suites.All (patch test-4.diff) since the main fix has gone in. Committed revision 754900.
          Hide
          Myrna van Lunteren added a comment -

          Can this isssue be closed as fixed in 10.5 and subsequent/remaining issues be pursued in a new JIRA?

          Show
          Myrna van Lunteren added a comment - Can this isssue be closed as fixed in 10.5 and subsequent/remaining issues be pursued in a new JIRA?
          Hide
          Knut Anders Hatlen added a comment -

          Yes, I think we can mark this issue as resolved now and address remaining cleanup, and bugs if they turn up, in separate issues.

          Show
          Knut Anders Hatlen added a comment - Yes, I think we can mark this issue as resolved now and address remaining cleanup, and bugs if they turn up, in separate issues.
          Hide
          Mamta A. Satoor added a comment -

          I recently merged changes for DERBY-3926 into 10.5.1.2 codeline (revision 784809) and I ran the junit tests on the merged code. The tests finished with one "A lock could not be obtained within the time requested". Kathey recommended that I post that failure here since it may be related to this jira entry. Following is the stack track from text junit runner (junit.textui.TestRunner)
          There was 1 error:
          1) testBTreeForwardScan_fetchRows_resumeAfterWait_nonUnique_split(org.apache.derbyTesting.functionTests.tests.store.IndexSplitDeadlockTest)java.sql.SQLException: A lock could not be obtained within the time requested
          at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
          at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Util.java:201)
          at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:391)
          at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
          at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2201)
          at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
          at org.apache.derby.impl.jdbc.EmbedResultSet.closeOnTransactionError(EmbedResultSet.java:4338)
          at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:467)
          at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:371)
          at org.apache.derbyTesting.functionTests.tests.store.IndexSplitDeadlockTest.testBTreeForwardScan_fetchRows_resumeAfterWait_nonUnique_split(IndexSplitDeadlockTest.java:489)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:106)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
          Caused by: ERROR 40XL1: A lock could not be obtained within the time requested at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276)
          at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(ConcurrentLockSet.java:602)
          at org.apache.derby.impl.services.locks.ConcurrentLockSet.zeroDurationLockObject(ConcurrentLockSet.java:855)
          at org.apache.derby.impl.services.locks.AbstractPool.zeroDurationlockObject(AbstractPool.java:297)
          at org.apache.derby.impl.store.raw.xact.RowLocking2nohold.lockRecordForRead(RowLocking2nohold.java:89)
          at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520)
          at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638)
          at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:309)
          at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599)
          at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105)
          at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305)
          at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1585)
          at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.reloadArray(BulkTableScanResultSet.java:327)
          at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.getNextRowCore(BulkTableScanResultSet.java:282)
          at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:460)
          at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:427)
          ... 34 more

          FAILURES!!!
          Tests run: 9258, Failures: 0, Errors: 1

          Show
          Mamta A. Satoor added a comment - I recently merged changes for DERBY-3926 into 10.5.1.2 codeline (revision 784809) and I ran the junit tests on the merged code. The tests finished with one "A lock could not be obtained within the time requested". Kathey recommended that I post that failure here since it may be related to this jira entry. Following is the stack track from text junit runner (junit.textui.TestRunner) There was 1 error: 1) testBTreeForwardScan_fetchRows_resumeAfterWait_nonUnique_split(org.apache.derbyTesting.functionTests.tests.store.IndexSplitDeadlockTest)java.sql.SQLException: A lock could not be obtained within the time requested at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Util.java:201) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:391) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2201) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81) at org.apache.derby.impl.jdbc.EmbedResultSet.closeOnTransactionError(EmbedResultSet.java:4338) at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:467) at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:371) at org.apache.derbyTesting.functionTests.tests.store.IndexSplitDeadlockTest.testBTreeForwardScan_fetchRows_resumeAfterWait_nonUnique_split(IndexSplitDeadlockTest.java:489) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:106) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) Caused by: ERROR 40XL1: A lock could not be obtained within the time requested at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276) at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(ConcurrentLockSet.java:602) at org.apache.derby.impl.services.locks.ConcurrentLockSet.zeroDurationLockObject(ConcurrentLockSet.java:855) at org.apache.derby.impl.services.locks.AbstractPool.zeroDurationlockObject(AbstractPool.java:297) at org.apache.derby.impl.store.raw.xact.RowLocking2nohold.lockRecordForRead(RowLocking2nohold.java:89) at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520) at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:309) at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599) at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105) at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305) at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1585) at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.reloadArray(BulkTableScanResultSet.java:327) at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.getNextRowCore(BulkTableScanResultSet.java:282) at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:460) at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:427) ... 34 more FAILURES!!! Tests run: 9258, Failures: 0, Errors: 1
          Hide
          Knut Anders Hatlen added a comment -

          Yes, that test was added here. It requires some coordination between two threads, so my first guess would be that there is a timing issue in the test. Please file a separate JIRA issue for this failure. Thanks.

          Show
          Knut Anders Hatlen added a comment - Yes, that test was added here. It requires some coordination between two threads, so my first guess would be that there is a timing issue in the test. Please file a separate JIRA issue for this failure. Thanks.
          Hide
          Mamta A. Satoor added a comment -

          Knut, thanks for looking at the stack trace. I added jira DERBY-4273 for the test failure.

          Show
          Mamta A. Satoor added a comment - Knut, thanks for looking at the stack trace. I added jira DERBY-4273 for the test failure.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Bogdan Calmac
            • Votes:
              13 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development