Derby
  1. Derby
  2. DERBY-713

CLONE - Query optimizer should not make poor choices when optimizing IN and WHERE clauses

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 10.0.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:
      all

      Description

      Consider a simple case of -
      A table tbl has 10000 rows, there is a primary key index on i1
      and the query in question is
      select * from tbl where i1 in (-1,100000)

      derby does a table scan of the entire table even though the "IN" list has only two values and the comparison is on a field that has an index.

      Briefly looking at the code, it seems like we insert a between and use the IN list to get the start and stop values for the scan. Thus the range of the values in the "IN" list here plays an important role.

      Thus if the query was changed to select * from tbl where i1 in (-1, 1), an index scan would be chosen.

      It would be nice if we could do something clever in this case where there is clearly an index on the field and the number of values in the IN list is known. Maybe use the rowcount estimate and the IN list size to do some optimizations.

      • consider the length of the "IN" list to do searches on the table. ie use the IN list values to do index key searches on the table,
        -or try to convert it to a join. Use the "IN" list values to create a temporary table and do a join. It is most likely that the optimizer will choose the table with "IN" list here as the outer table in the join and thus will do key searches on the larger table.

      -------------------------------------------------------------------
      some query plans that I logged using derby.language.logQueryPlan=true for some similar queries:

      Table has ascending values from 0 - 9999 for i1. primary key index on i1.

      GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from scanfixed where i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet (2):
      Number of opens = 1
      Rows seen = 10000
      Rows filtered = 9990
      restriction = true
      projection = false
      constructor time (milliseconds) = 0
      open time (milliseconds) = 0
      next time (milliseconds) = 0
      close time (milliseconds) = 0
      restriction time (milliseconds) = 0
      projection time (milliseconds) = 0
      optimizer estimated row count: 750.38
      optimizer estimated cost: 8579.46

      Source result set:
      Table Scan ResultSet for SCANFIXED at read committed isolation level using instantaneous share row locking chosen by the optimizer
      Number of opens = 1
      Rows seen = 10000
      Rows filtered = 0
      Fetch Size = 16
      constructor time (milliseconds) = 0
      open time (milliseconds) = 0
      next time (milliseconds) = 0
      close time (milliseconds) = 0
      next time in milliseconds/row = 0

      scan information:
      Bit set of columns fetched=All
      Number of columns fetched=9
      Number of pages visited=417
      Number of rows qualified=10000
      Number of rows visited=10000
      Scan type=heap
      start position:
      null stop position:
      null qualifiers:
      Column[0][0] Id: 0
      Operator: <=
      Ordered nulls: false
      Unknown return value: false
      Negate comparison result: false
      Column[0][1] Id: 0
      Operator: <
      Ordered nulls: false
      Unknown return value: true
      Negate comparison result: true

      optimizer estimated row count: 750.38
      optimizer estimated cost: 8579.46

      ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      l
      2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 0), select * from scanfixed where i1 in (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet (3):
      Number of opens = 1
      Rows seen = 10
      Rows filtered = 0
      restriction = true
      projection = true
      constructor time (milliseconds) = 0
      open time (milliseconds) = 0
      next time (milliseconds) = 0
      close time (milliseconds) = 0
      restriction time (milliseconds) = 0
      projection time (milliseconds) = 0
      optimizer estimated row count: 4.80
      optimizer estimated cost: 39.53

      Source result set:
      Index Row to Base Row ResultSet for SCANFIXED:
      Number of opens = 1
      Rows seen = 10
      Columns accessed from heap =

      {0, 1, 2, 3, 4, 5, 6, 7, 8}

      constructor time (milliseconds) = 0
      open time (milliseconds) = 0
      next time (milliseconds) = 0
      close time (milliseconds) = 0
      optimizer estimated row count: 4.80
      optimizer estimated cost: 39.53

      Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at read committed isolation level using instantaneous share row locking chosen by the optimizer
      Number of opens = 1
      Rows seen = 10
      Rows filtered = 0
      Fetch Size = 16
      constructor time (milliseconds) = 0
      open time (milliseconds) = 0
      next time (milliseconds) = 0
      close time (milliseconds) = 0
      next time in milliseconds/row = 0

      scan information:
      Bit set of columns fetched=All
      Number of columns fetched=2
      Number of deleted rows visited=0
      Number of pages visited=2
      Number of rows qualified=10
      Number of rows visited=10
      Scan type=btree
      Tree height=2
      start position:
      >= on first 1 column(s).
      Ordered null semantics on the following columns:

      stop position:
      > on first 1 column(s).
      Ordered null semantics on the following columns:

      qualifiers:
      None
      optimizer estimated row count: 4.80
      optimizer estimated cost: 39.53

        Issue Links

          Activity

          Hide
          A B added a comment -

          DERBY-47, of which this issue is a clone, has been closed. So I'm marking this issue as closed, too.

          Show
          A B added a comment - DERBY-47 , of which this issue is a clone, has been closed. So I'm marking this issue as closed, too.
          Hide
          A B added a comment -

          Based on the summary and comments for this issue, it sounds like this is a duplicate of DERBY-47. Since DERBY-47 is now in the "resolved" state (awaiting further feedback from anyone who might be able to try it out), I'm resolving this issue, as well. Resolution is "DUPLICATE".

          If this is not the correct action, people should feel free to post and/or modify as appropriate...

          Show
          A B added a comment - Based on the summary and comments for this issue, it sounds like this is a duplicate of DERBY-47 . Since DERBY-47 is now in the "resolved" state (awaiting further feedback from anyone who might be able to try it out), I'm resolving this issue, as well. Resolution is "DUPLICATE". If this is not the correct action, people should feel free to post and/or modify as appropriate...
          Hide
          Frank Karlstrøm added a comment -

          > I'm really surprised that this problem doesn't seem to be affecting a lot more people. Is it really the case that most users' queries are simple and
          > straight-forward enough not to be hit by this? Or are most Derby databases small enough to perform acceptably even when the optimizer makes poor
          > choices?

          I consider my database to be rather small, and my query to be relatively simple, but I think I am affected by this issue.
          Explanation: I have two tables, aTable and bTable with about 15 cols on a Table and 5 cols on bTable. bTable refers to aTable. aTable has about 3000 rows, while the bTable has about 30000 rows.

          The following query is executed against these two tables:
          select a.col1, a.col2, a.col3, (select sum(b.col1)
          from bTable as b where bTable.aTable_id=a.id)
          where a.id=####

          both table have appropiate indexes on a.id, b.id and b.aTable_id.

          When this query is executed, the queryplan indicates that a tablescan across the btable is executed, and all the rows are visited. there are only about 20 records for each aTable reference in bTable.

          this query averages to about 3-500ms, while the indexes had been used, I would guess the time would be max 30ms.

          Alternatives for this query:
          Create a in memory table of the sums for the bTable, and update it when needed. This cache will increase the memoryreq. for my application, and introduce extra maintenance and extra processing on insert/update/delete.
          rewrite the query to use joins and group by. Have tried it, and the same result happened, a tablescan acosss bTable.
          create a view over the bTable sum(). Have not tried. Will this increase performance perhaps?
          Create a flattened table for the bTable sums, and update it with triggers. This will increase performance, but introduce extra maintenace overhead, and extra processing on insert/update/delete.

          The best thing would of course be that this query used the indexes. DERBY-573 has exactly what I need, a way to specify how the join should be performed.

          Other suggestions are welcome.

          Show
          Frank Karlstrøm added a comment - > I'm really surprised that this problem doesn't seem to be affecting a lot more people. Is it really the case that most users' queries are simple and > straight-forward enough not to be hit by this? Or are most Derby databases small enough to perform acceptably even when the optimizer makes poor > choices? I consider my database to be rather small, and my query to be relatively simple, but I think I am affected by this issue. Explanation: I have two tables, aTable and bTable with about 15 cols on a Table and 5 cols on bTable. bTable refers to aTable. aTable has about 3000 rows, while the bTable has about 30000 rows. The following query is executed against these two tables: select a.col1, a.col2, a.col3, (select sum(b.col1) from bTable as b where bTable.aTable_id=a.id) where a.id=#### both table have appropiate indexes on a.id, b.id and b.aTable_id. When this query is executed, the queryplan indicates that a tablescan across the btable is executed, and all the rows are visited. there are only about 20 records for each aTable reference in bTable. this query averages to about 3-500ms, while the indexes had been used, I would guess the time would be max 30ms. Alternatives for this query: Create a in memory table of the sums for the bTable, and update it when needed. This cache will increase the memoryreq. for my application, and introduce extra maintenance and extra processing on insert/update/delete. rewrite the query to use joins and group by. Have tried it, and the same result happened, a tablescan acosss bTable. create a view over the bTable sum(). Have not tried. Will this increase performance perhaps? Create a flattened table for the bTable sums, and update it with triggers. This will increase performance, but introduce extra maintenace overhead, and extra processing on insert/update/delete. The best thing would of course be that this query used the indexes. DERBY-573 has exactly what I need, a way to specify how the join should be performed. Other suggestions are welcome.
          Hide
          Daniel James Neades added a comment -

          That's a helpful suggestion, thank you. Unfortunately, it isn't really practicable when this problem affects a significant number of complex queries in an application, where the relevent IN clause is a small part of a much larger query (which may have other parts that will degrade performance significantly if executed multiple times in the separate parts of the unions), and where the number of terms to the IN clause is completely variable (in our case, it's typically based on what items the user has selected in the application user interface, and in some cases could be dozens of items).

          The only way to achieve this suggestion would be for us to have our application dynamically construct complex SQL statements. That would makes debugging and maintenance much harder.

          I'm really surprised that this problem doesn't seem to be affecting a lot more people. Is it really the case that most users' queries are simple and straight-forward enough not to be hit by this? Or are most Derby databases small enough to perform acceptably even when the optimizer makes poor choices?

          Show
          Daniel James Neades added a comment - That's a helpful suggestion, thank you. Unfortunately, it isn't really practicable when this problem affects a significant number of complex queries in an application, where the relevent IN clause is a small part of a much larger query (which may have other parts that will degrade performance significantly if executed multiple times in the separate parts of the unions), and where the number of terms to the IN clause is completely variable (in our case, it's typically based on what items the user has selected in the application user interface, and in some cases could be dozens of items). The only way to achieve this suggestion would be for us to have our application dynamically construct complex SQL statements. That would makes debugging and maintenance much harder. I'm really surprised that this problem doesn't seem to be affecting a lot more people. Is it really the case that most users' queries are simple and straight-forward enough not to be hit by this? Or are most Derby databases small enough to perform acceptably even when the optimizer makes poor choices?
          Hide
          Satheesh Bandaram added a comment -

          If possible, rewriting the query

          select * from tbl where i1 in (-1,100000)

          to
          select * from tbl where i1= -1
          UNION
          select * from tbl where i1=100000

          could improve performance. Current Derby's OR/IN clause processing can be improved, but would need significant amount of work.

          Show
          Satheesh Bandaram added a comment - If possible, rewriting the query select * from tbl where i1 in (-1,100000) to select * from tbl where i1= -1 UNION select * from tbl where i1=100000 could improve performance. Current Derby's OR/IN clause processing can be improved, but would need significant amount of work.
          Hide
          Daniel James Neades added a comment -

          Creating a clone probably wasn't the right thing to do, but given the extra information now added to DERBY-47, it should be of type defect and not merely an improvement. My apologies if I've done the wrong thing, but perhaps a JIRA admin can sort this out and change DERBY-47's type and summary?

          The extra comments on DERBY-47 describe how the query optimizer is making very poor choices when dealing with IN clauses with multiple terms (and equivalent "WHERE thing=x OR thing=y OR thing=z" expressions). This makes Derby performance very poor, even for some simple queries.

          In addition, the optimizer seems to use inappropriate indexes in certain circumstances (again, described in comments added to DERBY-47), meaning that performance can degrade significantly degrade, merely by the presence of an additional index on a table. This means adding indexes to a table in an attempt to improve one query can unexpectedly degrade the performance of other queries. I believe that this should be considered a major defect.

          Show
          Daniel James Neades added a comment - Creating a clone probably wasn't the right thing to do, but given the extra information now added to DERBY-47 , it should be of type defect and not merely an improvement. My apologies if I've done the wrong thing, but perhaps a JIRA admin can sort this out and change DERBY-47 's type and summary? The extra comments on DERBY-47 describe how the query optimizer is making very poor choices when dealing with IN clauses with multiple terms (and equivalent "WHERE thing=x OR thing=y OR thing=z" expressions). This makes Derby performance very poor, even for some simple queries. In addition, the optimizer seems to use inappropriate indexes in certain circumstances (again, described in comments added to DERBY-47 ), meaning that performance can degrade significantly degrade, merely by the presence of an additional index on a table. This means adding indexes to a table in an attempt to improve one query can unexpectedly degrade the performance of other queries. I believe that this should be considered a major defect.

            People

            • Assignee:
              A B
              Reporter:
              Daniel James Neades
            • Votes:
              6 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development