Derby
  1. Derby
  2. DERBY-642

SELECT MAX doesn't use indices optimally

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 10.2.1.6
    • Fix Version/s: 10.8.1.2
    • Component/s: Store
    • Labels:
      None
    • Urgency:
      Low
    • Bug behavior facts:
      Performance

      Description

      I tried running SELECT MAX on an indexed column in a big (8 GB)
      table. It took 12 minutes, which is about 12 minutes more than I
      expected.

      After a bit of investigation, I found out that a full index scan was
      performed because all the rows referenced from the rightmost B-tree
      node were actually deleted.

      Possible improvements:

      1. Implement backwards scan in the B-trees (this is also suggested
      in the comments in BTreeMaxScan).

      2. Go up to the parent node and down to the next leaf node on the
      left side, and continue until a valid max value is found. In
      Derby, traversing up in a B-tree is not allowed, but would it be
      ok to go up if the latches were kept on the higher-level nodes in
      the tree? Of course, this would have negative impact on
      concurrency.

      3. Right-to-left traversal on the leaf level is possible (using
      ControlRow.getLeftSibling()), but will throw an exception if the
      page cannot be latched without waiting. It is therefore possible
      to try a right-to-left search for the max value, and only fall
      back to the left-to-right approach if a conflict arises.

      1. derby-642-1a.diff
        21 kB
        Knut Anders Hatlen
      2. derby-642-1b-withTests.diff
        42 kB
        Knut Anders Hatlen
      3. derby-642-2a-test-serializable.diff
        4 kB
        Knut Anders Hatlen
      4. derby-642-3a-waitBeforeRetry.diff
        2 kB
        Knut Anders Hatlen
      5. max.sql
        2 kB
        Rick Hillegas
      6. max2.sql
        0.2 kB
        Rick Hillegas
      7. Timestamper.java
        0.4 kB
        Rick Hillegas

        Issue Links

          Activity

          Hide
          Rick Hillegas added a comment -

          Attaching Timestamper.java, max.sql, and max2.sql. Together these form a test which verifies the value of this improvement. Timestamper is a Derby function which reports the number of seconds since it was called last. The max.sql script creates a table which has 3M rows and then indexes it. One out of every 100 rows contains a null. The max2.sql script selects the max value from the table and reports how long the aggregate took. When run on Derby 10.7.1, the aggregate takes 10 seconds. When run on 10.8.1, the aggregate returns the correct value immediately. Note that there is some instability in Derby which affects the max.sql script. I have noticed that sometimes the index creation dies on an OOM error. This happens as far back as 10.5.3.

          Show
          Rick Hillegas added a comment - Attaching Timestamper.java, max.sql, and max2.sql. Together these form a test which verifies the value of this improvement. Timestamper is a Derby function which reports the number of seconds since it was called last. The max.sql script creates a table which has 3M rows and then indexes it. One out of every 100 rows contains a null. The max2.sql script selects the max value from the table and reports how long the aggregate took. When run on Derby 10.7.1, the aggregate takes 10 seconds. When run on 10.8.1, the aggregate returns the correct value immediately. Note that there is some instability in Derby which affects the max.sql script. I have noticed that sometimes the index creation dies on an OOM error. This happens as far back as 10.5.3.
          Hide
          Knut Anders Hatlen added a comment -

          I'm marking this issue as fixed in 10.8 so that it turns up in the release notes.

          There's still the unaddressed issue with how to prevent phantom reads in serializable transactions with row lock granularity. Since BTreeMaxScan always uses table locks when the isolation level is serializable, it won't cause any problems currently, so I believe it is sufficient to address this issue when we add full support for backward scans. I'll add a comment about it in DERBY-884.

          Show
          Knut Anders Hatlen added a comment - I'm marking this issue as fixed in 10.8 so that it turns up in the release notes. There's still the unaddressed issue with how to prevent phantom reads in serializable transactions with row lock granularity. Since BTreeMaxScan always uses table locks when the isolation level is serializable, it won't cause any problems currently, so I believe it is sufficient to address this issue when we add full support for backward scans. I'll add a comment about it in DERBY-884 .
          Hide
          Knut Anders Hatlen added a comment -

          Committed the 3a patch with revision 1075248.

          Show
          Knut Anders Hatlen added a comment - Committed the 3a patch with revision 1075248.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching derby-642-3a-waitBeforeRetry.diff which implements an optimization mentioned in an earlier comment, and also discussed in DERBY-884.

          When the left sibling cannot be latched immediately, and we get a WaitError, we now try to get the latch on the left sibling again, possibly waiting for it, before we reposition in the B-tree. We save the position and release the latch on the current page before we do this, so waiting for the latch shouldn't be causing any deadlocks. We also release the latch we've just obtained on the left sibling before we do the repositioning, since we don't know if that's still the page we want to move to (anything may have happened while we didn't hold any latches).

          This change avoids the case where a scan gets a WaitError, then immediately retries just to get a new WaitError. It may need to try many times before it's actually able to reposition and move to the left sibling. The changes in the patch makes the scan wait until the other thread has released the latch, and then it's much more likely that it'll be able to retry the operation without getting a WaitError.

          Running BTreeMaxScanTest with the patch and tracing turned on, I see that the amount of log produced by the tracing is reduced by about 80%, which indicates that the scans spend less time on unsuccessful retries.

          Show
          Knut Anders Hatlen added a comment - Attaching derby-642-3a-waitBeforeRetry.diff which implements an optimization mentioned in an earlier comment, and also discussed in DERBY-884 . When the left sibling cannot be latched immediately, and we get a WaitError, we now try to get the latch on the left sibling again, possibly waiting for it, before we reposition in the B-tree. We save the position and release the latch on the current page before we do this, so waiting for the latch shouldn't be causing any deadlocks. We also release the latch we've just obtained on the left sibling before we do the repositioning, since we don't know if that's still the page we want to move to (anything may have happened while we didn't hold any latches). This change avoids the case where a scan gets a WaitError, then immediately retries just to get a new WaitError. It may need to try many times before it's actually able to reposition and move to the left sibling. The changes in the patch makes the scan wait until the other thread has released the latch, and then it's much more likely that it'll be able to retry the operation without getting a WaitError. Running BTreeMaxScanTest with the patch and tracing turned on, I see that the amount of log produced by the tracing is reduced by about 80%, which indicates that the scans spend less time on unsuccessful retries.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch (2a) which adds a test for the repositioning logic after waiting for a lock in serializable transactions. The test doesn't reveal any problems, but that's because the B-tree max scans are currently always performed with table locks in serializable transactions.

          The test case does expose the phantom read problem discussed above if we switch to record locking for those scans. That can be achieved by disabling the following code in FromBaseTable:

          /*

            • Figure out whether to do row locking or table locking.
              **
            • If there are no start/stop predicates, we're doing full
            • conglomerate scans, so do table locking.
              */
              if (! startStopFound) { currentAccessPath.setLockMode( TransactionController.MODE_TABLE); optimizer.trace(Optimizer.TABLE_LOCK_NO_START_STOP, 0, 0, 0.0, null); }

          I'm not planning to switch to record locking for max scans as part of this issue, but I'd like to have a test case for it in any case.

          Committed revision 1074196.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch (2a) which adds a test for the repositioning logic after waiting for a lock in serializable transactions. The test doesn't reveal any problems, but that's because the B-tree max scans are currently always performed with table locks in serializable transactions. The test case does expose the phantom read problem discussed above if we switch to record locking for those scans. That can be achieved by disabling the following code in FromBaseTable: /* Figure out whether to do row locking or table locking. ** If there are no start/stop predicates, we're doing full conglomerate scans, so do table locking. */ if (! startStopFound) { currentAccessPath.setLockMode( TransactionController.MODE_TABLE); optimizer.trace(Optimizer.TABLE_LOCK_NO_START_STOP, 0, 0, 0.0, null); } I'm not planning to switch to record locking for max scans as part of this issue, but I'd like to have a test case for it in any case. Committed revision 1074196.
          Hide
          Knut Anders Hatlen added a comment -

          As to the suggestion with previous key locking for serializable, we still have the problem that we need to save another position than the one we're trying to lock. When we have to wait for the previous key lock, it's the current key we want to save, whereas the locking code will save the key we're trying to lock (that is, the previous key). So even if we use previous key lock calls to solve this problem, we'll need to get the code to save the position immediately to the right of the row that we're waiting for.

          Show
          Knut Anders Hatlen added a comment - As to the suggestion with previous key locking for serializable, we still have the problem that we need to save another position than the one we're trying to lock. When we have to wait for the previous key lock, it's the current key we want to save, whereas the locking code will save the key we're trying to lock (that is, the previous key). So even if we use previous key lock calls to solve this problem, we'll need to get the code to save the position immediately to the right of the row that we're waiting for.
          Hide
          Knut Anders Hatlen added a comment -

          Committed the 1b patch to trunk with revision 1071171.

          Show
          Knut Anders Hatlen added a comment - Committed the 1b patch to trunk with revision 1071171.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Dag,

          Yes, that's what I was thinking. To be precise, we reposition to the row that was immediately to the right when we started waiting. When we wake up, there may be many rows between the row we waited for and the row we reposition to. In the current patch, we reposition on the row we're waiting for and then move one row to the right. These two approaches should be equivalent in the case where no potential phantom rows have been inserted while we were waiting.

          There may be a simpler solution, though. The existing code doesn't take previous key locks in the max scan, with comments saying that previous key locks aren't necessary for a max scan (which is true as long as the scan is restarted once we have to wait for a lock, like the existing code does). But it should be possible to perform previous key locking while moving backwards too by always locking the row immediately to the left of the one we're interested in. Then we'd always be one row ahead with the locking in serializable transactions, and even if we have to wait for a lock, the current position will be protected by the lock we acquired in the previous iteration, so it should be safe to reposition back to that position.

          Unfortunately, it's somewhat difficult to test whether this works as long as the optimizer insists on table locking for BTreeMaxScan in serializable transactions. I haven't yet found out exactly where this decision is made.

          Show
          Knut Anders Hatlen added a comment - Hi Dag, Yes, that's what I was thinking. To be precise, we reposition to the row that was immediately to the right when we started waiting. When we wake up, there may be many rows between the row we waited for and the row we reposition to. In the current patch, we reposition on the row we're waiting for and then move one row to the right. These two approaches should be equivalent in the case where no potential phantom rows have been inserted while we were waiting. There may be a simpler solution, though. The existing code doesn't take previous key locks in the max scan, with comments saying that previous key locks aren't necessary for a max scan (which is true as long as the scan is restarted once we have to wait for a lock, like the existing code does). But it should be possible to perform previous key locking while moving backwards too by always locking the row immediately to the left of the one we're interested in. Then we'd always be one row ahead with the locking in serializable transactions, and even if we have to wait for a lock, the current position will be protected by the lock we acquired in the previous iteration, so it should be safe to reposition back to that position. Unfortunately, it's somewhat difficult to test whether this works as long as the optimizer insists on table locking for BTreeMaxScan in serializable transactions. I haven't yet found out exactly where this decision is made.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for the explanation! So, if I read this correctly, you will repostion to the position to the immediate right of the one we had to wait for, instead of repositioning to the one we had to wait for? This would seem to protect against the phantom anomaly, I agree.

          Show
          Dag H. Wanvik added a comment - Thanks for the explanation! So, if I read this correctly, you will repostion to the position to the immediate right of the one we had to wait for, instead of repositioning to the one we had to wait for? This would seem to protect against the phantom anomaly, I agree.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching an updated patch (1b) with the following changes from 1a:

          • Changed class javadoc for BTreeMaxScan to describe the new approach
            instead of the old one.
          • Added a debug flag to allow tracing latch conflicts in sane builds.
          • Added a new test (BTreeMaxScanTest) with test cases that exercise
            the code that handles some of the latch conflict scenarios.

          If the test is run with derby.tests.trace=true, and a sane build is
          used, a message will be printed to derby.log each time a latch
          conflict that results in the need for repositioning is experienced by
          a BTreeMaxScan. This can be used to verify that the test actually
          exercised the code path we're interested in. For example, the test
          that verifies that backward scans don't deadlock with backward scans,
          will print something like this to derby.log:

          DEBUG BTreeMaxScan.latchConflict OUTPUT: Enable tracing for testOppositeScanDirections
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry
          [...]
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Disable tracing for testOppositeScanDirections

          And the test that verifies that the scan is restarted after a latch
          conflict when moving from an empty page, will print something like
          this:

          DEBUG BTreeMaxScan.latchConflict OUTPUT: Enable tracing for testEmptyRightmostLeaf
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Restart scan from rightmost leaf
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Restart scan from rightmost leaf
          [...]
          DEBUG BTreeMaxScan.latchConflict OUTPUT: Disable tracing for testEmptyRightmostLeaf

          Since latch conflicts are hard to test reliably, the test runs many
          iterations to increase the likelihood of a conflict, but there's no
          guarantee that it actually has been tested. That's why the tracing is
          added. Currently, the test takes almost a minute to run. This could be
          reduced by cutting down on the number of iterations, and we should
          probably do that when the feature has been stabilized, but I'll keep
          the high number of iterations for now.

          Although there still are some issues that need to be resolved, and
          more tests need to be written, I intend to commit this patch soon so
          that we can get as much testing as possible before the next release.
          If we haven't gained enough confidence in it by then, we can always
          back out the changes.

          As to the issues I think need to be resolved, or at least investigated
          further, I'm particularly thinking about what's the right way to do
          the repositioning after waiting for a lock. The old code would give up
          and start from the beginning of the B-tree in that case, whereas the
          new code will reposition on the row we had to wait for and continue
          there.

          I think that may be too simple, at least for serializable
          transactions. Since we're moving backwards, the previous key locking
          works a bit different. While we're waiting for the lock on the current
          row, the range between the row we're waiting for and the (deleted)
          record that we just saw is not protected. So when we get the lock, a
          row with a higher value may have been inserted. We won't see this row
          after the repositioning, and it will turn up as a phantom read
          (thereby breaking the requirements for the serializable isolation
          level) if we re-execute select max in the same transaction.

          Now this doesn't seem to cause any problem in practise, because in all
          the tests I've run, B-tree max scans in serializable transactions take
          table locks, preventing any concurrent modification. However, it would
          be good to get this right, especially so that the backward max scan
          can be extended to a full-featured backward scan implementation in
          DERBY-884, and also so that max scans at some point can safely be
          performed without table locks in serializable transactions.

          One possible solution may be to change BTreeScan.savePosition() to
          save the position immediately to the right of the current position
          when we scan backwards. That should prevent skipping rows that were
          inserted while we waited for the lock on the current position, since
          the range from the last visited row and up to infinity should be
          protected by the locks already acquired. Moving one step to the left
          from that saved position should get us back to the same position if no
          rows were inserted, or to the newly inserted row with the highest
          value otherwise.

          Show
          Knut Anders Hatlen added a comment - Attaching an updated patch (1b) with the following changes from 1a: Changed class javadoc for BTreeMaxScan to describe the new approach instead of the old one. Added a debug flag to allow tracing latch conflicts in sane builds. Added a new test (BTreeMaxScanTest) with test cases that exercise the code that handles some of the latch conflict scenarios. If the test is run with derby.tests.trace=true, and a sane build is used, a message will be printed to derby.log each time a latch conflict that results in the need for repositioning is experienced by a BTreeMaxScan. This can be used to verify that the test actually exercised the code path we're interested in. For example, the test that verifies that backward scans don't deadlock with backward scans, will print something like this to derby.log: DEBUG BTreeMaxScan.latchConflict OUTPUT: Enable tracing for testOppositeScanDirections DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry [...] DEBUG BTreeMaxScan.latchConflict OUTPUT: Disable tracing for testOppositeScanDirections And the test that verifies that the scan is restarted after a latch conflict when moving from an empty page, will print something like this: DEBUG BTreeMaxScan.latchConflict OUTPUT: Enable tracing for testEmptyRightmostLeaf DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry DEBUG BTreeMaxScan.latchConflict OUTPUT: Restart scan from rightmost leaf DEBUG BTreeMaxScan.latchConflict OUTPUT: Couldn't get latch nowait, will retry DEBUG BTreeMaxScan.latchConflict OUTPUT: Restart scan from rightmost leaf [...] DEBUG BTreeMaxScan.latchConflict OUTPUT: Disable tracing for testEmptyRightmostLeaf Since latch conflicts are hard to test reliably, the test runs many iterations to increase the likelihood of a conflict, but there's no guarantee that it actually has been tested. That's why the tracing is added. Currently, the test takes almost a minute to run. This could be reduced by cutting down on the number of iterations, and we should probably do that when the feature has been stabilized, but I'll keep the high number of iterations for now. Although there still are some issues that need to be resolved, and more tests need to be written, I intend to commit this patch soon so that we can get as much testing as possible before the next release. If we haven't gained enough confidence in it by then, we can always back out the changes. As to the issues I think need to be resolved, or at least investigated further, I'm particularly thinking about what's the right way to do the repositioning after waiting for a lock. The old code would give up and start from the beginning of the B-tree in that case, whereas the new code will reposition on the row we had to wait for and continue there. I think that may be too simple, at least for serializable transactions. Since we're moving backwards, the previous key locking works a bit different. While we're waiting for the lock on the current row, the range between the row we're waiting for and the (deleted) record that we just saw is not protected. So when we get the lock, a row with a higher value may have been inserted. We won't see this row after the repositioning, and it will turn up as a phantom read (thereby breaking the requirements for the serializable isolation level) if we re-execute select max in the same transaction. Now this doesn't seem to cause any problem in practise, because in all the tests I've run, B-tree max scans in serializable transactions take table locks, preventing any concurrent modification. However, it would be good to get this right, especially so that the backward max scan can be extended to a full-featured backward scan implementation in DERBY-884 , and also so that max scans at some point can safely be performed without table locks in serializable transactions. One possible solution may be to change BTreeScan.savePosition() to save the position immediately to the right of the current position when we scan backwards. That should prevent skipping rows that were inserted while we waited for the lock on the current position, since the range from the last visited row and up to infinity should be protected by the locks already acquired. Moving one step to the left from that saved position should get us back to the same position if no rows were inserted, or to the newly inserted row with the highest value otherwise.
          Hide
          Knut Anders Hatlen added a comment -

          Here's a patch that attempts to solve this issue by scanning the
          B-tree backwards to find the maximum value, along the lines sketched
          out in DERBY-884. The patch is not ready for commit yet since it
          contains no tests and some comments in the BTreeMaxScan still describe
          the old behaviour. There are also some parts that need a little more
          investigation, see more below.

          The patch passes suites.All and derbyall, and I have been able to
          exercise most of the new code in manual tests and it seems to be
          well-behaved. Some of the code paths are very difficult to test
          reliably, though, since they're so timing dependent.

          The patch changes the code the following way:

          • BTreeScan.java:
          • Added a new method positionAtPreviousPage(), which is similar to
            positionAtNextPage() except that it moves in the opposite direction
            and that it throws a WaitError instead of waiting for a latch in case
            of a conflict.

          Another difference is that positionAtPreviousPage() skips empty leaf
          pages. This is to avoid the problem with saving the current position
          when there is no key to save on the current page. When scanning in the
          forward direction, we only save the position when we have to wait for
          a row lock or we have filled the internal fetch buffer. In both cases
          we are positioned on a row when we save the position, and then we just
          need to store the key on the current row in order to save the
          position.

          When scanning in the backward direction, however, we may also have to
          save the position when moving from one leaf page to its sibling page,
          if the sibling page cannot be latched immediately. Leaf pages can be
          empty, and then there is no key on the page that we can store and use
          for repositioning later. So in order to be sure that we have a
          position to save in the case where we have to wait for a latch,
          positionAtPreviousPage() holds the latch on the current page until it
          finds the first non-empty leaf page on its left side.

          BTreeMaxScan.java:

          • Removed the method fetchMaxRowFromBeginning(), since the code now
            always fetches the max row from the end.
          • Made fetchMax() continue on the previous page instead of giving up
            and starting from the beginning in case no qualifying row is found on
            the last page.
          • Made fetchMax() reposition using the saved position in case of a
            lock conflict. Previously, it would have started a full scan from the
            beginning of the index in that case.
          • Added new method moveToLeftSibling(), used by fetchMax() to move to
            the logically next page. This method saves the current position if
            positionAtPreviousPage() throws a WaitError so that fetchMax() can
            release all latches and reposition on the exact same spot.

          Some things that differ between repositioning in the forward direction
          and in the backward direction:

          a) As mentioned above, we may need to save the position when moving
          from one page to another, and saving the position by key value isn't
          possible if the current page is empty. positionAtPreviousPage() skips
          empty pages so that we won't have to save the position on an empty
          page mid-scan. However, the rightmost leaf page may be empty, in which
          case the scan starts on an empty page. If we need to wait for a latch
          when moving from the rightmost leaf and that leaf is empty, we don't
          attempt to save the position. Instead of repositioning the scan, we
          will restart it completely, which should be a safe thing to do since
          we haven't seen any rows yet.

          b) After repositioning, we should be positioned on the exact same row
          as where we were when the position was saved. However, if the row has
          been purged in the meantime, reposition() will place us immediately to
          the left of where the row should have been. This is fine for forward
          scans, but not for backward scans. Backward scans should be positioned
          immediately to the right of that position (an issue also raised by
          Mike in DERBY-884). I didn't find any existing code to position us on
          the right side of the purged row (there's code to do that for partial
          key scans, but repositioning uses full keys). So in the case where
          reposition() can't position on the exact same key as the one we saved,
          the patch now adds one to the current slot number, which moves the
          position one slot to the right.

          Some issues that could need improvement or more investigation:

          • When a latch conflict is detected, the position is saved, all
            latches released, and the operation is retried immediately. If the
            other thread hasn't released the latch yet, we'll run into the same
            latch conflict again, and we may loop a couple of times, saving the
            position and reposititioning each time, before we can successfully
            latch the previous page. If we want to avoid this active waiting, we
            could refine this approach by doing an ordinary getPage() to latch the
            sibling page after we have released the latch on the current page, and
            wait until the page can be latched (and unlatched again) before we
            reposition. Since we don't hold the latch on the current page while
            waiting for the latch on the previous page, waiting here shouldn't
            cause any deadlocks.
          • The current implementation of fetchMax(), also before the patch,
            doesn't call unlockScanRecordAfterRead() to release read locks on
            unqualified records once we've moved past them. Other scans do that,
            and we should probably do that in max scans as well, but the current
            patch doesn't change this.
          • positionAtPreviousPage() was placed in the BTreeScan class since
            that's where positionAtNextPage() was. Probably these methods would be
            better placed in sub-classes for backward scan and forward scan. I
            think the reason why positionAtNextPage() was in that class, was that
            BTreeMaxScan used it when doing a max scan from the beginning of the
            index. Since the patch removed that code, the methods could be moved
            now. But I suppose that could rather be done as a cleanup after on.
          • Testing. We need tests that exercise the code paths that handle the
            latch conflicts. Some of these are very hard to test, especially the
            cases where the row on the current position has been purged in the
            small window between the saving of the current position and the
            repositioning. I have been able to exercise that code path by using a
            debugger to coordinate a forward scanner, a backward scanner and the
            B-tree post commit thread so that the row disappeared at the exact
            right moment to make the backward scanner take that path. I have only
            been able to trigger this path for repositioning after a latch
            conflict, though.

          To exercise the corresponding path for lock conflicts, I think the
          backward scanner needs to be timed so that it ends up waiting for a
          lock held by the B-tree post commit thread. That's the only way I can
          see how a row can have been purged when the scanner wakes up after
          having been granted the lock on the row, but perhaps someone sees
          other ways this could happen? Apart from the obvious timing problem of
          getting the scanner queued up right behind the post commit thread, I
          think the window where the post commit thread holds locks without
          holding the latch on the page it's working on, is extremely small. The
          backward scanner would have to access the page in this small window,
          otherwise it would be blocked by the latch and not by the lock.

          Next, I will write some JUnit tests that exercise the new code, and
          also add some tracing so that we can verify that the code actually has
          been exercised.

          Feedback about the suggested approach would be greatly appreciated.

          Show
          Knut Anders Hatlen added a comment - Here's a patch that attempts to solve this issue by scanning the B-tree backwards to find the maximum value, along the lines sketched out in DERBY-884 . The patch is not ready for commit yet since it contains no tests and some comments in the BTreeMaxScan still describe the old behaviour. There are also some parts that need a little more investigation, see more below. The patch passes suites.All and derbyall, and I have been able to exercise most of the new code in manual tests and it seems to be well-behaved. Some of the code paths are very difficult to test reliably, though, since they're so timing dependent. The patch changes the code the following way: BTreeScan.java: Added a new method positionAtPreviousPage(), which is similar to positionAtNextPage() except that it moves in the opposite direction and that it throws a WaitError instead of waiting for a latch in case of a conflict. Another difference is that positionAtPreviousPage() skips empty leaf pages. This is to avoid the problem with saving the current position when there is no key to save on the current page. When scanning in the forward direction, we only save the position when we have to wait for a row lock or we have filled the internal fetch buffer. In both cases we are positioned on a row when we save the position, and then we just need to store the key on the current row in order to save the position. When scanning in the backward direction, however, we may also have to save the position when moving from one leaf page to its sibling page, if the sibling page cannot be latched immediately. Leaf pages can be empty, and then there is no key on the page that we can store and use for repositioning later. So in order to be sure that we have a position to save in the case where we have to wait for a latch, positionAtPreviousPage() holds the latch on the current page until it finds the first non-empty leaf page on its left side. BTreeMaxScan.java: Removed the method fetchMaxRowFromBeginning(), since the code now always fetches the max row from the end. Made fetchMax() continue on the previous page instead of giving up and starting from the beginning in case no qualifying row is found on the last page. Made fetchMax() reposition using the saved position in case of a lock conflict. Previously, it would have started a full scan from the beginning of the index in that case. Added new method moveToLeftSibling(), used by fetchMax() to move to the logically next page. This method saves the current position if positionAtPreviousPage() throws a WaitError so that fetchMax() can release all latches and reposition on the exact same spot. Some things that differ between repositioning in the forward direction and in the backward direction: a) As mentioned above, we may need to save the position when moving from one page to another, and saving the position by key value isn't possible if the current page is empty. positionAtPreviousPage() skips empty pages so that we won't have to save the position on an empty page mid-scan. However, the rightmost leaf page may be empty, in which case the scan starts on an empty page. If we need to wait for a latch when moving from the rightmost leaf and that leaf is empty, we don't attempt to save the position. Instead of repositioning the scan, we will restart it completely, which should be a safe thing to do since we haven't seen any rows yet. b) After repositioning, we should be positioned on the exact same row as where we were when the position was saved. However, if the row has been purged in the meantime, reposition() will place us immediately to the left of where the row should have been. This is fine for forward scans, but not for backward scans. Backward scans should be positioned immediately to the right of that position (an issue also raised by Mike in DERBY-884 ). I didn't find any existing code to position us on the right side of the purged row (there's code to do that for partial key scans, but repositioning uses full keys). So in the case where reposition() can't position on the exact same key as the one we saved, the patch now adds one to the current slot number, which moves the position one slot to the right. Some issues that could need improvement or more investigation: When a latch conflict is detected, the position is saved, all latches released, and the operation is retried immediately. If the other thread hasn't released the latch yet, we'll run into the same latch conflict again, and we may loop a couple of times, saving the position and reposititioning each time, before we can successfully latch the previous page. If we want to avoid this active waiting, we could refine this approach by doing an ordinary getPage() to latch the sibling page after we have released the latch on the current page, and wait until the page can be latched (and unlatched again) before we reposition. Since we don't hold the latch on the current page while waiting for the latch on the previous page, waiting here shouldn't cause any deadlocks. The current implementation of fetchMax(), also before the patch, doesn't call unlockScanRecordAfterRead() to release read locks on unqualified records once we've moved past them. Other scans do that, and we should probably do that in max scans as well, but the current patch doesn't change this. positionAtPreviousPage() was placed in the BTreeScan class since that's where positionAtNextPage() was. Probably these methods would be better placed in sub-classes for backward scan and forward scan. I think the reason why positionAtNextPage() was in that class, was that BTreeMaxScan used it when doing a max scan from the beginning of the index. Since the patch removed that code, the methods could be moved now. But I suppose that could rather be done as a cleanup after on. Testing. We need tests that exercise the code paths that handle the latch conflicts. Some of these are very hard to test, especially the cases where the row on the current position has been purged in the small window between the saving of the current position and the repositioning. I have been able to exercise that code path by using a debugger to coordinate a forward scanner, a backward scanner and the B-tree post commit thread so that the row disappeared at the exact right moment to make the backward scanner take that path. I have only been able to trigger this path for repositioning after a latch conflict, though. To exercise the corresponding path for lock conflicts, I think the backward scanner needs to be timed so that it ends up waiting for a lock held by the B-tree post commit thread. That's the only way I can see how a row can have been purged when the scanner wakes up after having been granted the lock on the row, but perhaps someone sees other ways this could happen? Apart from the obvious timing problem of getting the scanner queued up right behind the post commit thread, I think the window where the post commit thread holds locks without holding the latch on the page it's working on, is extremely small. The backward scanner would have to access the page in this small window, otherwise it would be blocked by the latch and not by the lock. Next, I will write some JUnit tests that exercise the new code, and also add some tracing so that we can verify that the code actually has been exercised. Feedback about the suggested approach would be greatly appreciated.
          Hide
          Knut Anders Hatlen added a comment -

          This issue doesn't only affect the case where all the values in the rightmost leaf of the B-tree are deleted. Since NULL values sort higher than non-NULL values in Derby's B-trees, and NULL values are not considered by the MAX function, if there are enough rows with NULL in the indexed column to fill one page, the MAX scan won't find a maximum value in the rightmost leaf and will have to fall back to scanning the B-tree from the beginning.

          Show
          Knut Anders Hatlen added a comment - This issue doesn't only affect the case where all the values in the rightmost leaf of the B-tree are deleted. Since NULL values sort higher than non-NULL values in Derby's B-trees, and NULL values are not considered by the MAX function, if there are enough rows with NULL in the indexed column to fill one page, the MAX scan won't find a maximum value in the rightmost leaf and will have to fall back to scanning the B-tree from the beginning.
          Show
          Kathey Marsden added a comment - removing from 10.2. see: http://www.nabble.com/10.2-High-Value-Fix-Candidates-and-Fix-Version-Adjustments-tf2007999.html
          Hide
          Mike Matrigali added a comment -

          The current implementation was done as a way to get 90% for a very small amount of work - it was not a technical decision, simply a scheduling one.

          Option 1 would be the best technical decision, but is the hardest. – but the
          store work is not extremely hard just some care needs to be taken to handle
          the messy case of when latching going backward needs to wait. Basically
          when going right to left you must request the latch NOWAIT, and if you can't
          get the latch you need to code a way to save your current position in the
          tree, release all current latches, and wait for the desired latch. Once you
          have the latch then you have to refind your current position as it may have
          moved (but I believe it can only have moved right of the page you are on in the
          current btree implementation - implementer would need to verify that).

          If a real backward scan were implemented, then the next obvious step would
          be to teach the optimizer about it so that it could use it in plans. Doing it
          first for max would be a reasonable 1st step in a phased implementation
          approach. The second step would be to change the store interfaces to allow for a backward scan, and then finally change the language layer to actually
          use the backward scans.

          Option 3 would be an interesting extension to the "do what is easy" approach,
          it would be significantly easier than option 1.

          Not sure about option 2, it sounds deadlock prone - but could work if you
          paid the cost of more latches and less concurrency.

          Show
          Mike Matrigali added a comment - The current implementation was done as a way to get 90% for a very small amount of work - it was not a technical decision, simply a scheduling one. Option 1 would be the best technical decision, but is the hardest. – but the store work is not extremely hard just some care needs to be taken to handle the messy case of when latching going backward needs to wait. Basically when going right to left you must request the latch NOWAIT, and if you can't get the latch you need to code a way to save your current position in the tree, release all current latches, and wait for the desired latch. Once you have the latch then you have to refind your current position as it may have moved (but I believe it can only have moved right of the page you are on in the current btree implementation - implementer would need to verify that). If a real backward scan were implemented, then the next obvious step would be to teach the optimizer about it so that it could use it in plans. Doing it first for max would be a reasonable 1st step in a phased implementation approach. The second step would be to change the store interfaces to allow for a backward scan, and then finally change the language layer to actually use the backward scans. Option 3 would be an interesting extension to the "do what is easy" approach, it would be significantly easier than option 1. Not sure about option 2, it sounds deadlock prone - but could work if you paid the cost of more latches and less concurrency.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              3 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development