Derby
  1. Derby
  2. DERBY-6008

Allow ORDER BY and FETCH/OFFSET in set operands

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 10.10.1.1
    • Component/s: SQL
    • Labels:
      None
    • Bug behavior facts:
      Deviation from standard

      Description

      Currently, Derby doesn't allow ORDER BY nested in a set operand, e.g. in the following construct:

      (select i from t1 order by j offset 1 row) union
      (select i from t2 order by j desc offset 2 rows)

      This is allowed by the standard, as far as I can understand, cf. this quote from section 7.12 in SQL 2011:

      <query expression body> ::=
      <query term>

      <query expression body> UNION [ ALL DISTINCT ]
      [ <corresponding spec> ] <query term>
      <query expression body> EXCEPT [ ALL DISTINCT ]
      [ <corresponding spec> ] <query term>

      <query term> ::=
      <query primary>

      <query term> INTERSECT [ ALL DISTINCT ]
      [ <corresponding spec> ] <query primary>

      <query primary> ::=
      <simple table>

      <left paren> <query expression body>
      [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ] <right paren>

      I.e. the left paren chooses the second alternative in the production for <query primary>.

      1. derby-6008-e.stat
        0.8 kB
        Dag H. Wanvik
      2. derby-6008-e.diff
        46 kB
        Dag H. Wanvik
      3. derby-6008-d.stat
        0.8 kB
        Dag H. Wanvik
      4. derby-6008-d.diff
        46 kB
        Dag H. Wanvik
      5. derby-6008-c.stat
        0.6 kB
        Dag H. Wanvik
      6. derby-6008-c.diff
        15 kB
        Dag H. Wanvik
      7. derby-6008-b.stat
        0.4 kB
        Dag H. Wanvik
      8. derby-6008-b.diff
        9 kB
        Dag H. Wanvik
      9. derby-6008-a.stat
        0.4 kB
        Dag H. Wanvik
      10. derby-6008-a.diff
        8 kB
        Dag H. Wanvik
      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Dag H. Wanvik added a comment -

        Uploading a patch that makes this construction work, running regressions.

        Show
        Dag H. Wanvik added a comment - Uploading a patch that makes this construction work, running regressions.
        Hide
        Dag H. Wanvik added a comment -

        Regressions ran without errors.

        Show
        Dag H. Wanvik added a comment - Regressions ran without errors.
        Hide
        Dag H. Wanvik added a comment - - edited

        Uploaded a slightly improved version of the patch, "*-b", (reintroduced a sane assertion I first removed erroneously).

        Patch details:

        M java/engine/org/apache/derby/impl/sql/compile/sqlgrammar.jj

        Extends the production "nonJoinQueryPrimary" to allow "orderByClause" and "offsetFetchFirstClause".
        If we see one ore more of the above and the query expression is a SELECT (i.e. not a VALUES clause), we push the clauses into the SelectNode, else give an error message.

        M java/engine/org/apache/derby/impl/sql/compile/SelectNode.java

        Up till now, the clauses haven't been pushed down to SelectNode after binding; being collected by other node types, e.g. CursorNode and InsertNode, so we add logic to SelectNode to cater for the fact we now need to make binding happen of the clauses in that node type, too. Additionally, we do the shenanigans of hoisting up order by columns into the select if not already there, and remove duplicates after binding.

        M java/engine/org/apache/derby/impl/sql/compile/SetOperatorNode.java

        Any added hoisted order by column to the operand select node must be removed from the copied list when we build the result column list for thopse nodes.

        M java/engine/org/apache/derby/impl/sql/compile/ResultColumnList.java

        Extended two comments to reflect new reality.

        M java/testing/org/apache/derbyTesting/functionTests/tests/lang/OrderByAndOffsetFetchInSubqueries.java

        Added test fixture testNestingInsideSetOperation.

        Show
        Dag H. Wanvik added a comment - - edited Uploaded a slightly improved version of the patch, "*-b", (reintroduced a sane assertion I first removed erroneously). Patch details: M java/engine/org/apache/derby/impl/sql/compile/sqlgrammar.jj Extends the production "nonJoinQueryPrimary" to allow "orderByClause" and "offsetFetchFirstClause". If we see one ore more of the above and the query expression is a SELECT (i.e. not a VALUES clause), we push the clauses into the SelectNode, else give an error message. M java/engine/org/apache/derby/impl/sql/compile/SelectNode.java Up till now, the clauses haven't been pushed down to SelectNode after binding; being collected by other node types, e.g. CursorNode and InsertNode, so we add logic to SelectNode to cater for the fact we now need to make binding happen of the clauses in that node type, too. Additionally, we do the shenanigans of hoisting up order by columns into the select if not already there, and remove duplicates after binding. M java/engine/org/apache/derby/impl/sql/compile/SetOperatorNode.java Any added hoisted order by column to the operand select node must be removed from the copied list when we build the result column list for thopse nodes. M java/engine/org/apache/derby/impl/sql/compile/ResultColumnList.java Extended two comments to reflect new reality. M java/testing/org/apache/derbyTesting/functionTests/tests/lang/OrderByAndOffsetFetchInSubqueries.java Added test fixture testNestingInsideSetOperation.
        Hide
        Knut Anders Hatlen added a comment -

        I tested some queries in ij, and the patch seems to do the right thing.

        The syntax check in the set operations seems to be a little stricter than in a top-level query, as we accept

        VALUES 3, 2, 1 ORDER BY 1

        but not

        (VALUES 3, 2, 1 ORDER BY 1) UNION VALUES 4

        If the former is valid SQL, we should probably allow the latter too, but I haven't checked what the standard says.

        Apart from that, it looks fine to me.

        And a tiny nit, feel free to ignore: In the error checking added to nonJoinQueryPrimary(), the two inner if statement levels could probably be flattened to a single level, which might make it a little clearer.

        Show
        Knut Anders Hatlen added a comment - I tested some queries in ij, and the patch seems to do the right thing. The syntax check in the set operations seems to be a little stricter than in a top-level query, as we accept VALUES 3, 2, 1 ORDER BY 1 but not (VALUES 3, 2, 1 ORDER BY 1) UNION VALUES 4 If the former is valid SQL, we should probably allow the latter too, but I haven't checked what the standard says. Apart from that, it looks fine to me. And a tiny nit, feel free to ignore: In the error checking added to nonJoinQueryPrimary(), the two inner if statement levels could probably be flattened to a single level, which might make it a little clearer.
        Hide
        Dag H. Wanvik added a comment - - edited

        Thanks, Knut. I agree. Uploaded a version "c" of the patch which opens up for VALUES also. As far as I can see it's valid SQL. I kept a check on the node types in nonJoinQueryPrimary, albeit simplified in structure as you suggested, to guard against future extensions, although that's probably unnecessary now (checking for select, union or rowresultset nodes; the two latter used for VALUES).

        Added suitable bind-time operations for SetOperatorNode and RowResultSetNode. Note the boolean flag "dontRemoveOrderingColumns" which guards against removal when the node isn't really a UNION but represents the top level for a VALUES clause with an ORDER BY.

        Added more tests for the new cases. Rerunning regressions.

        Show
        Dag H. Wanvik added a comment - - edited Thanks, Knut. I agree. Uploaded a version "c" of the patch which opens up for VALUES also. As far as I can see it's valid SQL. I kept a check on the node types in nonJoinQueryPrimary, albeit simplified in structure as you suggested, to guard against future extensions, although that's probably unnecessary now (checking for select, union or rowresultset nodes; the two latter used for VALUES). Added suitable bind-time operations for SetOperatorNode and RowResultSetNode. Note the boolean flag "dontRemoveOrderingColumns" which guards against removal when the node isn't really a UNION but represents the top level for a VALUES clause with an ORDER BY. Added more tests for the new cases. Rerunning regressions.
        Hide
        Bryan Pendleton added a comment -

        This must be a (relatively) recent enhancement to SQL; my decades old memory of the overall
        structure of an SQL query was that ORDER BY was the "outermost" or final step in query processing.

        I have these vague memories of various relational "purists" even arguing that the theoretical reason
        for this was that ORDER BY was not part of the relational theory proper, but rather was just something
        that user interfaces would do for display purposes.

        That is, all processing was supposed to be done with mathematically pure "sets" of data, and so
        it was supposed to make no sense to ORDER BY an intermediate result.

        I guess what I'm trying to figure out is why you would ever want to ORDER BY a portion of your query
        results, rather than ordering the final result.

        Are there queries that will now return different results, as mathematical sets? Or is this just a way
        to sort some, but not all, of your data?

        Show
        Bryan Pendleton added a comment - This must be a (relatively) recent enhancement to SQL; my decades old memory of the overall structure of an SQL query was that ORDER BY was the "outermost" or final step in query processing. I have these vague memories of various relational "purists" even arguing that the theoretical reason for this was that ORDER BY was not part of the relational theory proper, but rather was just something that user interfaces would do for display purposes. That is, all processing was supposed to be done with mathematically pure "sets" of data, and so it was supposed to make no sense to ORDER BY an intermediate result. I guess what I'm trying to figure out is why you would ever want to ORDER BY a portion of your query results, rather than ordering the final result. Are there queries that will now return different results, as mathematical sets? Or is this just a way to sort some, but not all, of your data?
        Hide
        Knut Anders Hatlen added a comment -

        Hi Bryan,

        I think ORDER BY in sub-queries was added in SQL:2008, so you're right it's quite new. And you're also right that ORDER BY doesn't make much sense on its own in a sub-query of a set operation, as the top-level set operation doesn't guarantee that the ordering of the sub-queries is preserved in the final result. However, in combination with FETCH and OFFSET clauses, it starts making sense. If you for example want to see a list of all countries that are among the 10 largest countries in the world both in area and in population, you might be able to express that as:

        (SELECT NAME FROM COUNTRIES ORDER BY POPULATION DESC FETCH FIRST 10 ROWS ONLY)
        INTERSECT
        (SELECT NAME FROM COUNTRIES ORDER BY AREA DESC FETCH FIRST 10 ROWS ONLY)

        Here, the ORDER BY clauses do matter, as FETCH FIRST needs ordering in order to tell what FIRST actually means. Without the ORDER BY clauses, each sub-query would return no more than ten rows, but it would be undefined which ten rows.

        Show
        Knut Anders Hatlen added a comment - Hi Bryan, I think ORDER BY in sub-queries was added in SQL:2008, so you're right it's quite new. And you're also right that ORDER BY doesn't make much sense on its own in a sub-query of a set operation, as the top-level set operation doesn't guarantee that the ordering of the sub-queries is preserved in the final result. However, in combination with FETCH and OFFSET clauses, it starts making sense. If you for example want to see a list of all countries that are among the 10 largest countries in the world both in area and in population, you might be able to express that as: (SELECT NAME FROM COUNTRIES ORDER BY POPULATION DESC FETCH FIRST 10 ROWS ONLY) INTERSECT (SELECT NAME FROM COUNTRIES ORDER BY AREA DESC FETCH FIRST 10 ROWS ONLY) Here, the ORDER BY clauses do matter, as FETCH FIRST needs ordering in order to tell what FIRST actually means. Without the ORDER BY clauses, each sub-query would return no more than ten rows, but it would be undefined which ten rows.
        Hide
        Knut Anders Hatlen added a comment -

        Dag, I applied the "c" version of the patch and ran some tests in ij. This particular sequence didn't look quite right to me:

        ij version 10.10
        ij> connect 'jdbc:derby:memory:db;create=true';
        ij> create table countries(name varchar(20), population int, area int);
        0 rows inserted/updated/deleted
        ij> insert into countries values
        ('Norway', 5033675, 385252),
        ('Sweden', 9540065, 449964),
        ('Denmark', 5580413, 42894),
        ('Iceland', 320060, 103001),
        ('Liechtenstein', 36281, 160);
        5 rows inserted/updated/deleted
        ij> SELECT NAME FROM COUNTRIES ORDER BY POPULATION DESC FETCH FIRST 2 ROWS ONLY;
        NAME
        --------------------
        Sweden
        Denmark

        2 rows selected
        ij> SELECT NAME FROM COUNTRIES ORDER BY AREA DESC FETCH FIRST 2 ROWS ONLY;
        NAME
        --------------------
        Sweden
        Norway

        2 rows selected
        ij> (SELECT NAME FROM COUNTRIES ORDER BY POPULATION DESC FETCH FIRST 2 ROWS ONLY)
        INTERSECT
        (SELECT NAME FROM COUNTRIES ORDER BY AREA DESC FETCH FIRST 2 ROWS ONLY);
        NAME
        --------------------
        Denmark
        Iceland

        2 rows selected

        The intersection of ('Sweden', 'Denmark') and ('Sweden', 'Norway') should be ('Sweden') and not ('Denmark', 'Iceland').

        Show
        Knut Anders Hatlen added a comment - Dag, I applied the "c" version of the patch and ran some tests in ij. This particular sequence didn't look quite right to me: ij version 10.10 ij> connect 'jdbc:derby:memory:db;create=true'; ij> create table countries(name varchar(20), population int, area int); 0 rows inserted/updated/deleted ij> insert into countries values ('Norway', 5033675, 385252), ('Sweden', 9540065, 449964), ('Denmark', 5580413, 42894), ('Iceland', 320060, 103001), ('Liechtenstein', 36281, 160); 5 rows inserted/updated/deleted ij> SELECT NAME FROM COUNTRIES ORDER BY POPULATION DESC FETCH FIRST 2 ROWS ONLY; NAME -------------------- Sweden Denmark 2 rows selected ij> SELECT NAME FROM COUNTRIES ORDER BY AREA DESC FETCH FIRST 2 ROWS ONLY; NAME -------------------- Sweden Norway 2 rows selected ij> (SELECT NAME FROM COUNTRIES ORDER BY POPULATION DESC FETCH FIRST 2 ROWS ONLY) INTERSECT (SELECT NAME FROM COUNTRIES ORDER BY AREA DESC FETCH FIRST 2 ROWS ONLY); NAME -------------------- Denmark Iceland 2 rows selected The intersection of ('Sweden', 'Denmark') and ('Sweden', 'Norway') should be ('Sweden') and not ('Denmark', 'Iceland').
        Hide
        Knut Anders Hatlen added a comment -

        > Note the boolean flag "dontRemoveOrderingColumns" which guards against removal when the node isn't really a UNION but represents the top level for a VALUES clause with an ORDER BY.

        I was wondering about that... When a VALUES clause has an ORDER BY, it can only reference one of the columns that will actually be returned:

        ij> values 1,2 order by 1+2;
        ERROR 42878: The ORDER BY clause of a SELECT UNION statement only supports unqualified column references and column position numbers. Other expressions are not currently supported. (errorCode = 30000)

        So are there ever any generated ordering columns to remove in a VALUES clause? And if not, would it be harmful to call removeOrderByColumn() just to reduce the number of special cases?

        I found this example, though:

        ij> values 1 order by 1+2;
        1 |2
        -----------------------
        1 |3

        1 row selected

        Not sure why it doesn't raise the same syntax error as the first query. But at least it looks wrong that the ordering column is part of the result.

        I have logged DERBY-6009 for this existing bug.

        Show
        Knut Anders Hatlen added a comment - > Note the boolean flag "dontRemoveOrderingColumns" which guards against removal when the node isn't really a UNION but represents the top level for a VALUES clause with an ORDER BY. I was wondering about that... When a VALUES clause has an ORDER BY, it can only reference one of the columns that will actually be returned: ij> values 1,2 order by 1+2; ERROR 42878: The ORDER BY clause of a SELECT UNION statement only supports unqualified column references and column position numbers. Other expressions are not currently supported. (errorCode = 30000) So are there ever any generated ordering columns to remove in a VALUES clause? And if not, would it be harmful to call removeOrderByColumn() just to reduce the number of special cases? I found this example, though: ij> values 1 order by 1+2; 1 |2 ----------------------- 1 |3 1 row selected Not sure why it doesn't raise the same syntax error as the first query. But at least it looks wrong that the ordering column is part of the result. I have logged DERBY-6009 for this existing bug.
        Hide
        Dag H. Wanvik added a comment - - edited

        Thanks, Knut. I agree the intersect looks wrong indeed. As for the anomaly in values 1 order by 1+2, I guess this is because the logic to analyze the order by in this case resides in RowResultSetNode, not in UnionNode/SetOperatorNode as in the multi-valued case. I'd guess an omission since it's a corner case.
        And yes, I could remove the special case for removeOrderByColumns, probably, and just add a comment to remind the reader of the fact. I added it
        first because I thought I had an error requiring it, but that turned to be something else, but I kept it mostly as a documentary thing.

        Show
        Dag H. Wanvik added a comment - - edited Thanks, Knut. I agree the intersect looks wrong indeed. As for the anomaly in values 1 order by 1+2, I guess this is because the logic to analyze the order by in this case resides in RowResultSetNode, not in UnionNode/SetOperatorNode as in the multi-valued case. I'd guess an omission since it's a corner case. And yes, I could remove the special case for removeOrderByColumns, probably, and just add a comment to remind the reader of the fact. I added it first because I thought I had an error requiring it, but that turned to be something else, but I kept it mostly as a documentary thing.
        Hide
        Dag H. Wanvik added a comment - - edited

        The problem with the intersect is that, as part of its implementation, it pushes its own ORDER BY into the select, effectively clobbering the one specified by the user. I need to find a way to add another level in the result set tree so the two kinds of ordering can co-exist, ideally eliminating the implementation sort if the one specified gives the same ordering.

        Show
        Dag H. Wanvik added a comment - - edited The problem with the intersect is that, as part of its implementation, it pushes its own ORDER BY into the select, effectively clobbering the one specified by the user. I need to find a way to add another level in the result set tree so the two kinds of ordering can co-exist, ideally eliminating the implementation sort if the one specified gives the same ordering.
        Hide
        Dag H. Wanvik added a comment -

        Adding a new patch, which handles the presence of two orderByLists for SelectNode and UnionNode (qua VALUES top node) under an INTERSECT/EXCEPT: The user specifed one arising from this issue, and the one used by the implementation of the set operations. Currently we make no attempt to detect if the same sort is specified by the operand's ORDER BY as the sort of all columns used by the intersect/except comparison operation. The tree is stacked thus when we are ready to generate code:

        OrderByNode (op intersect/except)

        RowCountNode (offset/fetch first)

        OrderBynode (of operand)

        The optimizer is informed by the lower order by if present, if not, by the intersect/except sort. Added new test cases.

        Rerunning regressions.

        Show
        Dag H. Wanvik added a comment - Adding a new patch, which handles the presence of two orderByLists for SelectNode and UnionNode (qua VALUES top node) under an INTERSECT/EXCEPT: The user specifed one arising from this issue, and the one used by the implementation of the set operations. Currently we make no attempt to detect if the same sort is specified by the operand's ORDER BY as the sort of all columns used by the intersect/except comparison operation. The tree is stacked thus when we are ready to generate code: OrderByNode (op intersect/except) RowCountNode (offset/fetch first) OrderBynode (of operand) The optimizer is informed by the lower order by if present, if not, by the intersect/except sort. Added new test cases. Rerunning regressions.
        Hide
        Dag H. Wanvik added a comment -

        Regressions passed.

        Show
        Dag H. Wanvik added a comment - Regressions passed.
        Hide
        Dag H. Wanvik added a comment -

        I think the latest patch represent a net improvement to Derby, so I propose to commit it by the end of the week if I see no objections. Please review.

        Show
        Dag H. Wanvik added a comment - I think the latest patch represent a net improvement to Derby, so I propose to commit it by the end of the week if I see no objections. Please review.
        Hide
        Bryan Pendleton added a comment -

        I'm generally comfortable with the purpose of the patch, and with the approach you've taken to
        implement it. I don't see any reason not to proceed; it's a net improvement to Derby, and we
        can continue to test it in the trunk prior to releasing it in the next release.

        Two comments on the proposed final patch:
        1) It seems like there are tab-versus-space inconsistencies in the new code w.r.t. its placement
        in existing code. Not a big deal, but if it's easy to preserve the existing tabification ...
        2) There were parts of the patch where there were comments saying things like "presumably
        we've got a situation like this...", or "FIXME check if ..." Those sorts of comments in a patch
        seem like reminders to me of things I'd want to resolve before committing the patch, so I
        thought I'd check to see if they were things you wanted to fix prior to commit.

        Show
        Bryan Pendleton added a comment - I'm generally comfortable with the purpose of the patch, and with the approach you've taken to implement it. I don't see any reason not to proceed; it's a net improvement to Derby, and we can continue to test it in the trunk prior to releasing it in the next release. Two comments on the proposed final patch: 1) It seems like there are tab-versus-space inconsistencies in the new code w.r.t. its placement in existing code. Not a big deal, but if it's easy to preserve the existing tabification ... 2) There were parts of the patch where there were comments saying things like "presumably we've got a situation like this...", or "FIXME check if ..." Those sorts of comments in a patch seem like reminders to me of things I'd want to resolve before committing the patch, so I thought I'd check to see if they were things you wanted to fix prior to commit.
        Hide
        Dag H. Wanvik added a comment - - edited

        Thanks, Bryan! As for tabs, I gave up trying to retain tabs years ago, I try to make my patches only use blanks for all new code. I know it looks
        awkward when one reads the patch in a tab=8 context, but alas... I set my tab width to 4 in Emacs before I inspect patches to avoid the problem. I really wish we could get rid of the tabs once and for all soon, though.

        In the comment starting with "presumably" I have an assertion just below it to sanity check that the presumption holds. I'll make the comment point to it more directly so it's clearer that the two belong together.

        As for the FIXME, I think it's OK to leave it in since the likelihood of this happening is low and it would be ok to leave this optimization till someone really needed it, IMHO. Perhaps I should avoid the FIXME word, though, and just say "this can be optimized".?

        Show
        Dag H. Wanvik added a comment - - edited Thanks, Bryan! As for tabs, I gave up trying to retain tabs years ago, I try to make my patches only use blanks for all new code. I know it looks awkward when one reads the patch in a tab=8 context, but alas... I set my tab width to 4 in Emacs before I inspect patches to avoid the problem. I really wish we could get rid of the tabs once and for all soon, though. In the comment starting with "presumably" I have an assertion just below it to sanity check that the presumption holds. I'll make the comment point to it more directly so it's clearer that the two belong together. As for the FIXME, I think it's OK to leave it in since the likelihood of this happening is low and it would be ok to leave this optimization till someone really needed it, IMHO. Perhaps I should avoid the FIXME word, though, and just say "this can be optimized".?
        Hide
        Bryan Pendleton added a comment -

        > I should avoid the FIXME word, though, and just say "this can be optimized"

        I think that would be clearer. Thanks for the follow-up, time to commit!

        Show
        Bryan Pendleton added a comment - > I should avoid the FIXME word, though, and just say "this can be optimized" I think that would be clearer. Thanks for the follow-up, time to commit!
        Hide
        Dag H. Wanvik added a comment -

        Uploading version "e" of this patch, which only introduces some comment improvements and some minor simplifications.

        Show
        Dag H. Wanvik added a comment - Uploading version "e" of this patch, which only introduces some comment improvements and some minor simplifications.
        Hide
        Dag H. Wanvik added a comment -

        Committed version "e" as svn 1422164.

        Show
        Dag H. Wanvik added a comment - Committed version "e" as svn 1422164.
        Hide
        Dag H. Wanvik added a comment - - edited

        Marking resolved, but I leave it open for now if anyone wants to back-port it.

        [Update: Thinking again, that may not be advisable since it introduces a new functionality. But the documentation changes are pending still, so I keep this open.]

        Show
        Dag H. Wanvik added a comment - - edited Marking resolved, but I leave it open for now if anyone wants to back-port it. [Update: Thinking again, that may not be advisable since it introduces a new functionality. But the documentation changes are pending still, so I keep this open.]

          People

          • Assignee:
            Dag H. Wanvik
            Reporter:
            Dag H. Wanvik
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development