Derby
  1. Derby
  2. DERBY-5584

Select statement with subqueries with group by and count distinct statements returns wrong number of results

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.6.2.4, 10.7.1.1
    • Fix Version/s: 10.8.3.0, 10.9.1.0
    • Component/s: SQL
    • Environment:
    • Urgency:
      Urgent
    • Issue & fix info:
      High Value Fix, Repro attached
    • Bug behavior facts:
      Deviation from standard, Regression, Wrong query result

      Description

      Steps to reproduce:

      1. Create database, connect to database with any JDBC client

      2. create two tables:

      CREATE TABLE TEST_5 (
      profile_id INTEGER NOT NULL,
      group_ref INTEGER NOT NULL,
      matched_count INTEGER NOT NULL
      );

      CREATE TABLE TEST_6 (
      profile_id INTEGER NOT NULL,
      group_ref INTEGER NOT NULL,
      matched_count INTEGER NOT NULL
      );

      3. Insert two records for each table:

      insert into test_5 values (1, 10000,1);
      insert into test_5 values (2, 10000, 2);

      insert into test_6 values (1, 10000,1);
      insert into test_6 values (2, 10000, 2);

      4. Run following statement

      SELECT *
      FROM
      (SELECT ps1.group_ref,
      COUNT(DISTINCT ps1.matched_count) AS matched_count
      FROM test_5 ps1
      GROUP BY ps1.group_ref,
      ps1.profile_id
      ) a,
      (SELECT ps2.group_ref,
      COUNT( DISTINCT ps2.matched_count) AS matched_count
      FROM test_6 ps2
      GROUP BY ps2.group_ref,
      ps2.profile_id
      ) b

      As a result I've got 3 records instead of 4 - at least Oracle 10g
      returns 4 records for this statement. Maybe i'm doing something wrong.
      Do you have any suggestions / possible workarounds for this problem

      1. patch2.txt
        13 kB
        Bryan Pendleton
      2. patch1.txt
        12 kB
        Bryan Pendleton
      3. tests.out
        9 kB
        Bryan Pendleton
      4. tests.sql
        4 kB
        Bryan Pendleton
      5. try1.txt
        3 kB
        Bryan Pendleton
      6. query.log
        190 kB
        Dag H. Wanvik

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          It looks like Derby returned 4 rows for the query up to version 10.5. From 10.6, it started returning 3 rows.

          Show
          Knut Anders Hatlen added a comment - It looks like Derby returned 4 rows for the query up to version 10.5. From 10.6, it started returning 3 rows.
          Hide
          Knut Anders Hatlen added a comment -

          The query result changed from 4 rows to 3 rows at revision 824966, which was when DERBY-3002 (Add support for GROUP BY ROLLUP) went in.

          Show
          Knut Anders Hatlen added a comment - The query result changed from 4 rows to 3 rows at revision 824966, which was when DERBY-3002 (Add support for GROUP BY ROLLUP) went in.
          Hide
          Bryan Pendleton added a comment -

          Thanks for narrowing this down. Unfortunately nothing immediately jumps to mind and it's been rather a while since I worked on that code ...

          Perhaps we might get some clues by comparing the query plans and statistics from the 10.5 execution against the query plans and statistics from the 10.6 execution.

          As I recall when working with that part of the system, there were often problems where the optimizer's manipulation of the query nodes got confused about the internal table number and column number references. For example, if at some crucial point we thought we were working with table test_5 but we were actually working with table test_6...

          Show
          Bryan Pendleton added a comment - Thanks for narrowing this down. Unfortunately nothing immediately jumps to mind and it's been rather a while since I worked on that code ... Perhaps we might get some clues by comparing the query plans and statistics from the 10.5 execution against the query plans and statistics from the 10.6 execution. As I recall when working with that part of the system, there were often problems where the optimizer's manipulation of the query nodes got confused about the internal table number and column number references. For example, if at some crucial point we thought we were working with table test_5 but we were actually working with table test_6...
          Hide
          Bryan Pendleton added a comment -

          Not only are there 3 rows instead of 4, but the 3rd row has strange values. When I run the query I see:

          GROUP_REF |MATCHED_CO&|GROUP_REF |MATCHED_CO&
          -----------------------------------------------
          10000 |1 |10000 |1
          10000 |1 |10000 |1
          10000 |1 |10000 |2

          Given that the two subqueries, separately, each return

          GROUP_REF |MATCHED_CO&
          -----------------------
          10000 |1
          10000 |1

          I think that the resulting cartesian product 4-row answer should have been:

          GROUP_REF |MATCHED_CO&|GROUP_REF |MATCHED_CO&
          -----------------------------------------------
          10000 |1 |10000 |1
          10000 |1 |10000 |1
          10000 |1 |10000 |1
          10000 |1 |10000 |1

          That is, it doesn't seem like we're losing a row; rather, it seems like we're combining two rows into one.

          So perhaps the problem has something to do with temporary data management; that is, perhaps during the execution of the cartesian produce we are closing and re-opening the inner table-expression, and when we do that, instead of re-computing and returning two rows (10000,1), (10000,1), we are instead returning a single row (10000,2)

          I tried running a slight variation of the problematic query:

          SELECT *
          FROM
          (SELECT ps1.group_ref, ps1.profile_id,
          COUNT(DISTINCT ps1.matched_count) AS matched_count
          FROM test_5 ps1
          GROUP BY ps1.group_ref,
          ps1.profile_id
          ) a,
          (SELECT ps2.group_ref, ps2.profile_id,
          COUNT( DISTINCT ps2.matched_count) AS matched_count
          FROM test_6 ps2
          GROUP BY ps2.group_ref,
          ps2.profile_id
          ) b

          (note that I included 'profile_id' explicitly in the select list of each sub-query), and the results were:

          GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO&
          -----------------------------------------------------------------------
          10000 |1 |1 |10000 |1 |1
          10000 |1 |1 |10000 |2 |1
          10000 |2 |1 |10000 |1 |2

          I also extended the test case a bit by additionally doing:

          insert into test_6 values (3, 10000, 3 );

          after which the query produced:

          GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO&
          -----------------------------------------------------------------------
          10000 |1 |1 |10000 |1 |1
          10000 |1 |1 |10000 |2 |1
          10000 |1 |1 |10000 |3 |1
          10000 |2 |1 |10000 |1 |3

          and then

          insert into test_6 values (4, 10000, 4 );

          after which the query produced:

          GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO&
          -----------------------------------------------------------------------
          10000 |1 |1 |10000 |1 |1
          10000 |1 |1 |10000 |2 |1
          10000 |1 |1 |10000 |3 |1
          10000 |1 |1 |10000 |4 |1
          10000 |2 |1 |10000 |1 |4

          And, lastly,

          insert into test_5 values (3, 10000, 3);

          and got

          GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO&
          -----------------------------------------------------------------------
          10000 |1 |1 |10000 |1 |1
          10000 |1 |1 |10000 |2 |1
          10000 |1 |1 |10000 |3 |1
          10000 |1 |1 |10000 |4 |1
          10000 |2 |1 |10000 |1 |4
          10000 |3 |1 |10000 |4 |4

          By this point it becomes quite clear that the problem is that, on the first pass through the inner table,
          the cartesian product observes all the rows of the inner table, but on all subsequent passes, the
          inner table's rows are collapsed into a single row (perhaps because we are losing track of
          the 'profile_id' column in that inner table somehow?)

          I'm not sure how much more time I'll have to play with this right away, but wanted to record these findings.

          Show
          Bryan Pendleton added a comment - Not only are there 3 rows instead of 4, but the 3rd row has strange values. When I run the query I see: GROUP_REF |MATCHED_CO&|GROUP_REF |MATCHED_CO& ----------------------------------------------- 10000 |1 |10000 |1 10000 |1 |10000 |1 10000 |1 |10000 |2 Given that the two subqueries, separately, each return GROUP_REF |MATCHED_CO& ----------------------- 10000 |1 10000 |1 I think that the resulting cartesian product 4-row answer should have been: GROUP_REF |MATCHED_CO&|GROUP_REF |MATCHED_CO& ----------------------------------------------- 10000 |1 |10000 |1 10000 |1 |10000 |1 10000 |1 |10000 |1 10000 |1 |10000 |1 That is, it doesn't seem like we're losing a row; rather, it seems like we're combining two rows into one. So perhaps the problem has something to do with temporary data management; that is, perhaps during the execution of the cartesian produce we are closing and re-opening the inner table-expression, and when we do that, instead of re-computing and returning two rows (10000,1), (10000,1), we are instead returning a single row (10000,2) I tried running a slight variation of the problematic query: SELECT * FROM (SELECT ps1.group_ref, ps1.profile_id, COUNT(DISTINCT ps1.matched_count) AS matched_count FROM test_5 ps1 GROUP BY ps1.group_ref, ps1.profile_id ) a, (SELECT ps2.group_ref, ps2.profile_id, COUNT( DISTINCT ps2.matched_count) AS matched_count FROM test_6 ps2 GROUP BY ps2.group_ref, ps2.profile_id ) b (note that I included 'profile_id' explicitly in the select list of each sub-query), and the results were: GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO& ----------------------------------------------------------------------- 10000 |1 |1 |10000 |1 |1 10000 |1 |1 |10000 |2 |1 10000 |2 |1 |10000 |1 |2 I also extended the test case a bit by additionally doing: insert into test_6 values (3, 10000, 3 ); after which the query produced: GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO& ----------------------------------------------------------------------- 10000 |1 |1 |10000 |1 |1 10000 |1 |1 |10000 |2 |1 10000 |1 |1 |10000 |3 |1 10000 |2 |1 |10000 |1 |3 and then insert into test_6 values (4, 10000, 4 ); after which the query produced: GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO& ----------------------------------------------------------------------- 10000 |1 |1 |10000 |1 |1 10000 |1 |1 |10000 |2 |1 10000 |1 |1 |10000 |3 |1 10000 |1 |1 |10000 |4 |1 10000 |2 |1 |10000 |1 |4 And, lastly, insert into test_5 values (3, 10000, 3); and got GROUP_REF |PROFILE_ID |MATCHED_CO&|GROUP_REF |PROFILE_ID |MATCHED_CO& ----------------------------------------------------------------------- 10000 |1 |1 |10000 |1 |1 10000 |1 |1 |10000 |2 |1 10000 |1 |1 |10000 |3 |1 10000 |1 |1 |10000 |4 |1 10000 |2 |1 |10000 |1 |4 10000 |3 |1 |10000 |4 |4 By this point it becomes quite clear that the problem is that, on the first pass through the inner table, the cartesian product observes all the rows of the inner table, but on all subsequent passes, the inner table's rows are collapsed into a single row (perhaps because we are losing track of the 'profile_id' column in that inner table somehow?) I'm not sure how much more time I'll have to play with this right away, but wanted to record these findings.
          Hide
          Dag H. Wanvik added a comment -

          Attaching a dump of Brian's last query with a dump of

          a) parse tree
          b) bind tree
          c) optimized tree
          d) query plan

          with object hashes renumbered from 1 upwards and redundant node printing truncated.

          I have not analyzed it.

          Show
          Dag H. Wanvik added a comment - Attaching a dump of Brian's last query with a dump of a) parse tree b) bind tree c) optimized tree d) query plan with object hashes renumbered from 1 upwards and redundant node printing truncated. I have not analyzed it.
          Hide
          Bryan Pendleton added a comment -

          Thanks Dag for gathering the output!

          Show
          Bryan Pendleton added a comment - Thanks Dag for gathering the output!
          Hide
          Bryan Pendleton added a comment -

          If you remove DISTINCT from the COUNT() aggregates, the problem disappears (from what I can tell).

          I think it's possible that the problem involves this section of GroupedAggregateResultSet:

          /*

            • If there was a distinct aggregate, then that column
            • was automatically included as the last column in
            • the sort ordering. But we don't want it to be part
            • of the ordering anymore, because we aren't grouping
            • by that column, we just sorted it so that distinct
            • aggregation would see the values in order.
              */

          This part was quite tricky, and I remember struggling with it for a long time.

          I think that the solution that is currently implemented in GroupedAggregateResultSet
          may be assuming that the result set is only opened and read once.

          However, during a query plan such as this cartesian product, the GROUP BY
          subquery is created, then opened/read/closed, opened/read/closed, etc.,
          once per row of the other side of the cartesian product.

          Perhaps what we need to do is have a better way of handling that extra
          invisible column, so that we can consider it sometimes, and ignore it other times,
          without doing something as destructive as physically removing it from the
          ordering array, which is what we do now.

          Show
          Bryan Pendleton added a comment - If you remove DISTINCT from the COUNT() aggregates, the problem disappears (from what I can tell). I think it's possible that the problem involves this section of GroupedAggregateResultSet: /* If there was a distinct aggregate, then that column was automatically included as the last column in the sort ordering. But we don't want it to be part of the ordering anymore, because we aren't grouping by that column, we just sorted it so that distinct aggregation would see the values in order. */ This part was quite tricky, and I remember struggling with it for a long time. I think that the solution that is currently implemented in GroupedAggregateResultSet may be assuming that the result set is only opened and read once. However, during a query plan such as this cartesian product, the GROUP BY subquery is created, then opened/read/closed, opened/read/closed, etc., once per row of the other side of the cartesian product. Perhaps what we need to do is have a better way of handling that extra invisible column, so that we can consider it sometimes, and ignore it other times, without doing something as destructive as physically removing it from the ordering array, which is what we do now.
          Hide
          Bryan Pendleton added a comment -

          Attached 'try1.txt' contains a patch which alters the behavior
          of this particular test case.

          I haven't run any of the existing tests, nor have I written any
          new tests; I just attached this patch to record the progress I've
          made so far.

          I'm not sure where to find more examples of DISTINCT
          aggregates. I recall looking for such examples years ago, when
          I was first working on GROUP BY, and didn't find many. Does
          anybody have any good ideas about where to find more
          DISTINCT aggregate queries that we could include in our
          test suites?

          Show
          Bryan Pendleton added a comment - Attached 'try1.txt' contains a patch which alters the behavior of this particular test case. I haven't run any of the existing tests, nor have I written any new tests; I just attached this patch to record the progress I've made so far. I'm not sure where to find more examples of DISTINCT aggregates. I recall looking for such examples years ago, when I was first working on GROUP BY, and didn't find many. Does anybody have any good ideas about where to find more DISTINCT aggregate queries that we could include in our test suites?
          Hide
          Bryan Pendleton added a comment -

          The existing GroupByTest and the other GroupBy-related
          tests in the lang suite appear to be passing with the patch
          applied. So I'm encouraged, and planning to move ahead.

          Attached (in tests.sql, tests.out) are the new tests I'm
          intending to add to GroupByTest.java, and the results
          I'm currently getting (with try1.txt applied).

          Does anybody have the time to look through tests.sql
          and tests.out and offer an opinion about whether the
          results as shown are correct or not?

          Show
          Bryan Pendleton added a comment - The existing GroupByTest and the other GroupBy-related tests in the lang suite appear to be passing with the patch applied. So I'm encouraged, and planning to move ahead. Attached (in tests.sql, tests.out) are the new tests I'm intending to add to GroupByTest.java, and the results I'm currently getting (with try1.txt applied). Does anybody have the time to look through tests.sql and tests.out and offer an opinion about whether the results as shown are correct or not?
          Hide
          Bryan Pendleton added a comment -

          Attached 'patch1' is the code change + the tests,
          packaged into a patch file.

          Still haven't run the whole regression suite, just GroupByTest.

          Show
          Bryan Pendleton added a comment - Attached 'patch1' is the code change + the tests, packaged into a patch file. Still haven't run the whole regression suite, just GroupByTest.
          Hide
          Dag H. Wanvik added a comment -

          I looked through the sample test.

          {sql,out}

          and the result seems to implement the correct SQL semantics to me.
          The last two queries show the expected (1,2,1) vs (2,3,1) I would expect from the COUNT DISTINCT vs plain COUNT in the cartesian product.

          Show
          Dag H. Wanvik added a comment - I looked through the sample test. {sql,out} and the result seems to implement the correct SQL semantics to me. The last two queries show the expected (1,2,1) vs (2,3,1) I would expect from the COUNT DISTINCT vs plain COUNT in the cartesian product.
          Hide
          Dag H. Wanvik added a comment -

          Thanks for tackling this Bryan! Looked at the patch and what you do seems reasonable to me. I am puzzled by the comment:

          // Although it seems like N aggs could have been
          // added at the end, in fact only one has been
          // FIXME – need to get GroupByNode to handle this
          // correctly, but that requires understanding
          // scalar distinct aggregates.

          Can you throw some light on this? Is this a implementation limitation. If so, what kind of query would be affected by it?

          Show
          Dag H. Wanvik added a comment - Thanks for tackling this Bryan! Looked at the patch and what you do seems reasonable to me. I am puzzled by the comment: // Although it seems like N aggs could have been // added at the end, in fact only one has been // FIXME – need to get GroupByNode to handle this // correctly, but that requires understanding // scalar distinct aggregates. Can you throw some light on this? Is this a implementation limitation. If so, what kind of query would be affected by it?
          Hide
          Dag H. Wanvik added a comment -

          For the record, I also verified that the new text fixture fails without the code part of the patch.

          Show
          Dag H. Wanvik added a comment - For the record, I also verified that the new text fixture fails without the code part of the patch.
          Hide
          Bryan Pendleton added a comment -

          Thanks for the review, Dag. I'm cautiously optimistic about the patch and will continue testing it.

          Regarding the distinct limitation, Derby has had a limitation that there be at most one DISTINCT
          aggregate in a query for a long time, probably ever since it was written. See, for example, this
          link from the 10.2 docs: http://db.apache.org/derby/docs/10.2/ref/rrefsqlj32693.html

          Only one DISTINCT aggregate expression per SelectExpression is allowed.
          For example, the following query is not valid:

          SELECT AVG (DISTINCT flying_time), SUM (DISTINCT miles) FROM Flights

          I'm guessing here, but I suspect that the plan for implementing such queries would involve
          optimizer work above the GroupedAggregateResultSet. In particular, the optimizer could
          (perhaps) implement the above query by doing:

          1) Fetch all of Flights, and pass it through a GroupedAggregateResultSet which (a) sorts
          by flying_time, (b) eliminates duplicate values, and (c) computes the average of the
          unique flying_time values
          2) Fetch all of Flights, and pass it through a GroupedAggregateResultSet which (a) sorts
          by miles, (b) eliminates duplicate values, and (c) computes the sum of the unique miles values.
          3) Combine the two scalar values into the final result

          Trying to do all of this in a single pass through the Flights data is complex, because you
          can't simultaneously sort the data on flying_time AND on miles, so you'd have to have some
          other way of eliminating duplicates, other than sorting.

          For example, we could have some sort of low-level HashMap collection to record the
          unique values, and have a hybrid algorithm which sorted one one distinct aggregate and
          kept hash collections of the other distinct aggregates (assuming all those hashes fit in memory).

          I think that, in practice, those databases which implement multiple DISTINCT aggregates
          do so by having higher-level query plans which take multiple passes over the data.

          In all my time with Derby, I can't recall anyone ever complaining about the DISTINCT
          aggregate limitation. Moreover, in my use of databases I haven't found occasion to regularly
          use DISTINCT aggregates. For whatever that's worth.

          Thanks again for the review and feedback; I'll continue testing and working on the patch
          (I'm out of town for a week so it may be a little while).

          Show
          Bryan Pendleton added a comment - Thanks for the review, Dag. I'm cautiously optimistic about the patch and will continue testing it. Regarding the distinct limitation, Derby has had a limitation that there be at most one DISTINCT aggregate in a query for a long time, probably ever since it was written. See, for example, this link from the 10.2 docs: http://db.apache.org/derby/docs/10.2/ref/rrefsqlj32693.html Only one DISTINCT aggregate expression per SelectExpression is allowed. For example, the following query is not valid: SELECT AVG (DISTINCT flying_time), SUM (DISTINCT miles) FROM Flights I'm guessing here, but I suspect that the plan for implementing such queries would involve optimizer work above the GroupedAggregateResultSet. In particular, the optimizer could (perhaps) implement the above query by doing: 1) Fetch all of Flights, and pass it through a GroupedAggregateResultSet which (a) sorts by flying_time, (b) eliminates duplicate values, and (c) computes the average of the unique flying_time values 2) Fetch all of Flights, and pass it through a GroupedAggregateResultSet which (a) sorts by miles, (b) eliminates duplicate values, and (c) computes the sum of the unique miles values. 3) Combine the two scalar values into the final result Trying to do all of this in a single pass through the Flights data is complex, because you can't simultaneously sort the data on flying_time AND on miles, so you'd have to have some other way of eliminating duplicates, other than sorting. For example, we could have some sort of low-level HashMap collection to record the unique values, and have a hybrid algorithm which sorted one one distinct aggregate and kept hash collections of the other distinct aggregates (assuming all those hashes fit in memory). I think that, in practice, those databases which implement multiple DISTINCT aggregates do so by having higher-level query plans which take multiple passes over the data. In all my time with Derby, I can't recall anyone ever complaining about the DISTINCT aggregate limitation. Moreover, in my use of databases I haven't found occasion to regularly use DISTINCT aggregates. For whatever that's worth. Thanks again for the review and feedback; I'll continue testing and working on the patch (I'm out of town for a week so it may be a little while).
          Hide
          Dag H. Wanvik added a comment -

          Thanks for that info and your continued work on this, Bryan!

          As for the implementation strategy of such an improvement, hash lookup
          is interesting. Would BackingStoreHashTable be useful here perhaps?
          The optimizer could then use the hash approach when it thinks the
          distinct values will fit in memory but still work, albeit slowly, if
          it got the estimate wrong. The multiple sort/pass strategy would used
          if the values would be deemed not to fit. An optimizer override could
          be used if the optimizer gets it all wrong, as a last resort.

          In any case, it may not be worth the effort if nobody needs this.

          Show
          Dag H. Wanvik added a comment - Thanks for that info and your continued work on this, Bryan! As for the implementation strategy of such an improvement, hash lookup is interesting. Would BackingStoreHashTable be useful here perhaps? The optimizer could then use the hash approach when it thinks the distinct values will fit in memory but still work, albeit slowly, if it got the estimate wrong. The multiple sort/pass strategy would used if the values would be deemed not to fit. An optimizer override could be used if the optimizer gets it all wrong, as a last resort. In any case, it may not be worth the effort if nobody needs this.
          Hide
          Mamta A. Satoor added a comment -

          Hi Bryan, thanks for your work on this jira.

          I haven't worked in code related to DISTINCT and GROUP BY but your solution looks like the correct appraoch. Looks like Derby has made an incorrect assumption about the extra column when there are multiple passes through the inner table.

          I am working on DERBY-4631 and noticed that an extra column gets pulled in (which might not be part of the SELECT list) for ORDER BY columns which sounds similar to the extra column for this jira.

          Show
          Mamta A. Satoor added a comment - Hi Bryan, thanks for your work on this jira. I haven't worked in code related to DISTINCT and GROUP BY but your solution looks like the correct appraoch. Looks like Derby has made an incorrect assumption about the extra column when there are multiple passes through the inner table. I am working on DERBY-4631 and noticed that an extra column gets pulled in (which might not be part of the SELECT list) for ORDER BY columns which sounds similar to the extra column for this jira.
          Hide
          Bryan Pendleton added a comment -

          Thanks for looking at the patch, Mamta. I agree, the extra column that may be
          added by ORDER BY is very similar to the GROUP BY case. I think I had that
          ORDER BY behavior in mind when I was working on the GROUP BY code in 2009.

          One difference is that, I think, ORDER BY can only occur at the top level of a
          query, not in a nested sub-query, whereas GROUP BY can occur deeply nested.

          So I don't think that ORDER BY result sets will get re-opened and re-read
          in the same way that this GROUP BY query runs.

          I will take a look at DERBY-4631 and let you know if I have any interesting observations.

          Show
          Bryan Pendleton added a comment - Thanks for looking at the patch, Mamta. I agree, the extra column that may be added by ORDER BY is very similar to the GROUP BY case. I think I had that ORDER BY behavior in mind when I was working on the GROUP BY code in 2009. One difference is that, I think, ORDER BY can only occur at the top level of a query, not in a nested sub-query, whereas GROUP BY can occur deeply nested. So I don't think that ORDER BY result sets will get re-opened and re-read in the same way that this GROUP BY query runs. I will take a look at DERBY-4631 and let you know if I have any interesting observations.
          Hide
          Dag H. Wanvik added a comment -

          No longer, ORDER BY can now happen in subqueries too, after DERBY-4397. So,
          could we have a similar issue there, I wonder?

          Show
          Dag H. Wanvik added a comment - No longer, ORDER BY can now happen in subqueries too, after DERBY-4397 . So, could we have a similar issue there, I wonder?
          Hide
          Dag H. Wanvik added a comment -

          I tested it on this data set with a similar nested query and it seems to work ok:

          SELECT * FROM (SELECT ps1.group_ref, ps1.profile_id FROM test_5 ps1 ORDER BY profile_id) a, (SELECT ps2.group_ref, ps2.profile_id FROM test_6 ps2 ORDER BY PROFILE_ID) b ;

          GROUP_REF |PROFILE_ID |GROUP_REF |PROFILE_ID
          -----------------------------------------------
          10000 |1 |10000 |1
          10000 |1 |10000 |1
          10000 |1 |10000 |2
          10000 |1 |10000 |2
          10000 |1 |10000 |2
          10000 |1 |10000 |3
          etc.

          Show
          Dag H. Wanvik added a comment - I tested it on this data set with a similar nested query and it seems to work ok: SELECT * FROM (SELECT ps1.group_ref, ps1.profile_id FROM test_5 ps1 ORDER BY profile_id) a, (SELECT ps2.group_ref, ps2.profile_id FROM test_6 ps2 ORDER BY PROFILE_ID) b ; GROUP_REF |PROFILE_ID |GROUP_REF |PROFILE_ID ----------------------------------------------- 10000 |1 |10000 |1 10000 |1 |10000 |1 10000 |1 |10000 |2 10000 |1 |10000 |2 10000 |1 |10000 |2 10000 |1 |10000 |3 etc.
          Hide
          Kathey Marsden added a comment -

          Triage for 10.9. Moved component to SQL and checked appropriate boxes.
          I wonder. Is this a regression?

          Thanks Bryan for working on this issue.

          Show
          Kathey Marsden added a comment - Triage for 10.9. Moved component to SQL and checked appropriate boxes. I wonder. Is this a regression? Thanks Bryan for working on this issue.
          Hide
          Bryan Pendleton added a comment -

          Hi Kathey! Yes, I believe this is a regression. This query, and similar ones, worked correctly
          prior to my work on ROLLUP. I checked the "regression" box.

          Show
          Bryan Pendleton added a comment - Hi Kathey! Yes, I believe this is a regression. This query, and similar ones, worked correctly prior to my work on ROLLUP. I checked the "regression" box.
          Hide
          Mike Matrigali added a comment -

          I ran full set of tests with the patch and it passed on windows, xp, ibm16 with no errors.

          Show
          Mike Matrigali added a comment - I ran full set of tests with the patch and it passed on windows, xp, ibm16 with no errors.
          Hide
          Bryan Pendleton added a comment -

          Thanks for running the full test suite, Mike; it's very helpful to know
          that the patch hasn't introduced any other regressions.

          The last thing I'm intending to do prior to commit is to add the extra
          ORDER BY subquery test that Dag mentioned in an earlier comment.

          I intend to commit this patch this week.

          Show
          Bryan Pendleton added a comment - Thanks for running the full test suite, Mike; it's very helpful to know that the patch hasn't introduced any other regressions. The last thing I'm intending to do prior to commit is to add the extra ORDER BY subquery test that Dag mentioned in an earlier comment. I intend to commit this patch this week.
          Hide
          Dag H. Wanvik added a comment -

          Note that the nested "order by" like in my example is rather meaningless, since there is no guarantee the final result will be sorted in any particular way without an ORDER BY on the outer select. It would only make sense if combined with FETCH FIRST n rows, which would limit the table rows going into the join. So perhaps it would be good to add that in the test case.

          Show
          Dag H. Wanvik added a comment - Note that the nested "order by" like in my example is rather meaningless, since there is no guarantee the final result will be sorted in any particular way without an ORDER BY on the outer select. It would only make sense if combined with FETCH FIRST n rows, which would limit the table rows going into the join. So perhaps it would be good to add that in the test case.
          Hide
          Bryan Pendleton added a comment -

          Updated the patch with an additional ORDER BY subquery test.

          Also cleaned up the whitespace in the diff and added
          a simple comment for the new helper method.

          Show
          Bryan Pendleton added a comment - Updated the patch with an additional ORDER BY subquery test. Also cleaned up the whitespace in the diff and added a simple comment for the new helper method.
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Bryan! Looks like good improvement, and the changes are small and systematic, and make sense to me. +1

          Show
          Dag H. Wanvik added a comment - Thanks, Bryan! Looks like good improvement, and the changes are small and systematic, and make sense to me. +1
          Hide
          Bryan Pendleton added a comment -

          Thanks all for the reviews and suggestions and assistance running tests
          and preparing the patch. It's been very helpful, especially as I don't have
          so much time available to work on Derby nowadays.

          Committed the patch to the trunk as revision 1292134.

          Show
          Bryan Pendleton added a comment - Thanks all for the reviews and suggestions and assistance running tests and preparing the patch. It's been very helpful, especially as I don't have so much time available to work on Derby nowadays. Committed the patch to the trunk as revision 1292134.
          Hide
          Kathey Marsden added a comment -

          merged to 10.8 with revision 1300625

          Show
          Kathey Marsden added a comment - merged to 10.8 with revision 1300625
          Hide
          Kathey Marsden added a comment -

          This issue probably affects 10.6 and 10.7 as the ROLLUP work went into 10.6, It would be a good candidate to backport to those releases, but does not affect 10.5 so labeling derby_backport_reject_10_5

          Show
          Kathey Marsden added a comment - This issue probably affects 10.6 and 10.7 as the ROLLUP work went into 10.6, It would be a good candidate to backport to those releases, but does not affect 10.5 so labeling derby_backport_reject_10_5
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update: close all resolved issues that haven't had any activity the last year]

          Show
          Knut Anders Hatlen added a comment - [bulk update: close all resolved issues that haven't had any activity the last year]

            People

            • Assignee:
              Bryan Pendleton
              Reporter:
              Piotr Zgadzaj
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development