Derby
  1. Derby
  2. DERBY-4406

Wrong order when using ORDER BY on non-deterministic function

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 10.5.3.0, 10.6.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Issue & fix info:
      Repro attached
    • Bug behavior facts:
      Deviation from standard

      Description

      If I read the SQL standard correctly, a statement such as "select random() as r from t order by random()" should be treated as "select random() as r from t order by r". Derby does however generate a second, hidden random() column by which the rows are ordered.

      ij> create table t(x int);
      0 rows inserted/updated/deleted
      ij> insert into t values 1,2,3,4,5;
      5 rows inserted/updated/deleted
      ij> – wrong result, not ordered by r
      ij> select random() as r from t order by random();
      R
      ----------------------
      0.1285512465366495
      0.5116860880915798
      0.21060042130229073
      0.2506706923680875
      0.6378857329935494

      5 rows selected
      ij> – correct result, ordered by r
      ij> select random() as r from t order by r;
      R
      ----------------------
      0.0749025910679918
      0.07694931688380491
      0.1724114605785414
      0.2268758969382877
      0.31900450349277965

      5 rows selected

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          See SQL:2003, part 2, section 14.1 <declare cursor>. Under syntax rules:

          18) If an <order by clause> is specified, then:

          (...)

          d) Case:
          i) If <sort specification list> contains any <sort key> Ki that contains a
          column reference to a column that is not a column of T, then:

          (...)

          6) For each i between 1 (one) and NSK, if KTi has the same left normal
          form derivation as the <value expression> immediately contained in
          some <derived column> DC of SLT, then:
          (...)

          A) Case:

          I) If DC simply contains an <as clause>, then let CN be the
          <column name> contained in the <as clause>.

          II) Otherwise, let CN be an implementation-dependent <column name>
          that is not equivalent to the explicit or implicit <column
          name> of any other <derived column> contained in SLT. Let VE
          be the <value expression> simply contained in DC. DC is
          replaced in SLT by VE AS CN

          B) KTi is replaced in OBCT by CN

          Show
          Knut Anders Hatlen added a comment - See SQL:2003, part 2, section 14.1 <declare cursor>. Under syntax rules: 18) If an <order by clause> is specified, then: (...) d) Case: i) If <sort specification list> contains any <sort key> Ki that contains a column reference to a column that is not a column of T, then: (...) 6) For each i between 1 (one) and NSK, if KTi has the same left normal form derivation as the <value expression> immediately contained in some <derived column> DC of SLT, then: (...) A) Case: I) If DC simply contains an <as clause>, then let CN be the <column name> contained in the <as clause>. II) Otherwise, let CN be an implementation-dependent <column name> that is not equivalent to the explicit or implicit <column name> of any other <derived column> contained in SLT. Let VE be the <value expression> simply contained in DC. DC is replaced in SLT by VE AS CN B) KTi is replaced in OBCT by CN
          Hide
          Bryan Pendleton added a comment -

          Thanks for investigating this, Knut. This duplication of the expression implementation
          has always bothered me. I noticed it a while back, and I agree with both your assessment
          of the current behavior, and with your interpretation of the desired behavior.

          Show
          Bryan Pendleton added a comment - Thanks for investigating this, Knut. This duplication of the expression implementation has always bothered me. I noticed it a while back, and I agree with both your assessment of the current behavior, and with your interpretation of the desired behavior.
          Hide
          Knut Anders Hatlen added a comment -

          What I don't find as clear, is what should happen if there are two or more matches in the select list, like

          select random() as r1, random() as r2 from t order by random()

          PostgreSQL and MySQL order by r1 in this case, but I haven't found chapter and verse for this in the standard.

          Show
          Knut Anders Hatlen added a comment - What I don't find as clear, is what should happen if there are two or more matches in the select list, like select random() as r1, random() as r2 from t order by random() PostgreSQL and MySQL order by r1 in this case, but I haven't found chapter and verse for this in the standard.
          Hide
          Rick Hillegas added a comment -

          Hi Knut,

          I asked the SQL committee for advice on what is the correct ordering when an expression in the sort key matches two expressions in the SELECT list. E.g.:

          select random() * a, random() * a
          from t
          order by random() * a

          I received two responses. One simply confirmed that the spec didn't seem to cover that case. The other response was more detailed:

          "You're right, the rule does not specify which column from the SELECT list to use.
          Since this feature has been around since SQL:1999 (14.1 <declare cursor> SR 18)f)i)2)A))
          I would not support making the standard more definite. So we could say that
          it is implementation-dependent or implementation-defined, which column is chosen.
          I personally favor implementation-dependent, because this is such a screwball example
          anyway. You are welcome to write a comment and a paper to solve it.
          If we leave the standard wording untouched, I would say that it supports
          implementation-dependent because the use of "some" implies that when there is more
          than one, the implementation can decide which one based on its own whim."

          The terms "implementation-dependent" and "implementation-defined" are defined in part 1 of the Standard, section 3.1:

          3.1.1.8 implementation-defined: Possibly differing between SQL-implementations, but specified by the
          implementor for each particular SQL-implementation.
          3.1.1.9 implementation-dependent: Possibly differing between SQL-implementations, but not specified
          by ISO/IEC 9075, and not required to be specified by the implementor for any particular SQL-
          implementation.

          I think this means that we are free to do either of the following:

          1) Pick a matching column to sort by, using some well defined rule. E.g., sort by the first matching column.

          2) Pick a matching column at random and sort by it.

          I vote for (1) if it is easy to figure out the first matching column. If it is not easy to figure out the first matching column, then for this edge case I would be content with option (2).

          Thanks.

          Show
          Rick Hillegas added a comment - Hi Knut, I asked the SQL committee for advice on what is the correct ordering when an expression in the sort key matches two expressions in the SELECT list. E.g.: select random() * a, random() * a from t order by random() * a I received two responses. One simply confirmed that the spec didn't seem to cover that case. The other response was more detailed: "You're right, the rule does not specify which column from the SELECT list to use. Since this feature has been around since SQL:1999 (14.1 <declare cursor> SR 18)f)i)2)A)) I would not support making the standard more definite. So we could say that it is implementation-dependent or implementation-defined, which column is chosen. I personally favor implementation-dependent, because this is such a screwball example anyway. You are welcome to write a comment and a paper to solve it. If we leave the standard wording untouched, I would say that it supports implementation-dependent because the use of "some" implies that when there is more than one, the implementation can decide which one based on its own whim." The terms "implementation-dependent" and "implementation-defined" are defined in part 1 of the Standard, section 3.1: 3.1.1.8 implementation-defined: Possibly differing between SQL-implementations, but specified by the implementor for each particular SQL-implementation. 3.1.1.9 implementation-dependent: Possibly differing between SQL-implementations, but not specified by ISO/IEC 9075, and not required to be specified by the implementor for any particular SQL- implementation. I think this means that we are free to do either of the following: 1) Pick a matching column to sort by, using some well defined rule. E.g., sort by the first matching column. 2) Pick a matching column at random and sort by it. I vote for (1) if it is easy to figure out the first matching column. If it is not easy to figure out the first matching column, then for this edge case I would be content with option (2). Thanks.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for investigating this, Rick. I agree that (1) sounds like the best option.

          Show
          Knut Anders Hatlen added a comment - Thanks for investigating this, Rick. I agree that (1) sounds like the best option.
          Hide
          Nirmal Fernando added a comment - - edited

          Hi,

          I feel the results given by Derby is true. Following is my explanation:

          ij> – wrong result, not ordered by r
          ij> select random() as r from t order by random();
          R
          ----------------------
          0.1285512465366495
          0.5116860880915798
          0.21060042130229073
          0.2506706923680875
          0.6378857329935494

          -here the query first select 5 random numbers and named them as 'r'. Since random() is a particular function you can't use it to refer a specific data set (in this case r), therefore "order by random()" should reorganize those 5 random numbers, randomly (not in ascending/ descending order).

          So the same concept can be applied to following query too.

          select random() as r1, random() as r2 from t order by random()

          Say:

          r1 r2
          ------------------------------ -------------------------
          0.1285512465366495 0.3423578873009
          0.5116860880915798 0.5689209748765
          0.21060042130229073 0.4257678987510
          0.2506706923680875 0.1278737846784
          0.6378857329935494 0.8636983857498

          Result would be:

          r1 r2
          -------------------------------| -------------------------------
          0.6378857329935494 | 0.8636983857498
          0.5116860880915798 | 0.5689209748765
          0.1285512465366495 | 0.3423578873009
          0.2106004213022907 | 0.4257678987510
          0.2506706923680875 | 0.1278737846784

          Show
          Nirmal Fernando added a comment - - edited Hi, I feel the results given by Derby is true. Following is my explanation: ij> – wrong result, not ordered by r ij> select random() as r from t order by random(); R ---------------------- 0.1285512465366495 0.5116860880915798 0.21060042130229073 0.2506706923680875 0.6378857329935494 -here the query first select 5 random numbers and named them as 'r'. Since random() is a particular function you can't use it to refer a specific data set (in this case r), therefore "order by random()" should reorganize those 5 random numbers, randomly (not in ascending/ descending order). So the same concept can be applied to following query too. select random() as r1, random() as r2 from t order by random() Say: r1 r2 ------------------------------ ------------------------- 0.1285512465366495 0.3423578873009 0.5116860880915798 0.5689209748765 0.21060042130229073 0.4257678987510 0.2506706923680875 0.1278737846784 0.6378857329935494 0.8636983857498 Result would be: r1 r2 -------------------------------| ------------------------------- 0.6378857329935494 | 0.8636983857498 0.5116860880915798 | 0.5689209748765 0.1285512465366495 | 0.3423578873009 0.2106004213022907 | 0.4257678987510 0.2506706923680875 | 0.1278737846784
          Hide
          Nirmal Fernando added a comment -

          Hi all,
          Seems like I got it wrong !! After performing the query in PostgreSQL I found that the output provide by Derby is incorrect.
          By the way, I like look in to this bug if it possible try to fix this.

          My understanding is we should check for the occurrence of the first "random()" and sort the table according to that when "random()" comes after an "order by" query is issued.

          I would like to get some views.

          Thanks!!

          Show
          Nirmal Fernando added a comment - Hi all, Seems like I got it wrong !! After performing the query in PostgreSQL I found that the output provide by Derby is incorrect. By the way, I like look in to this bug if it possible try to fix this. My understanding is we should check for the occurrence of the first "random()" and sort the table according to that when "random()" comes after an "order by" query is issued. I would like to get some views. Thanks!!
          Hide
          Bryan Pendleton added a comment -

          Hi Nirmal, I think it would be great if you can look at this problem. You should mark the
          entry as 'assigned' to you while you are studying it to let others know (you can unassign
          it later if you so choose).

          I think that you'll find that it is helpful to study the code in OrderByColumn.java and OrderByList.java.

          In particular, you'll want to understand the concept of "pulled up" columns, which
          is a construct that Derby uses to handle cases like

          select a,b from t order by c, d

          Since columns c and d don't appear in the select list, but need to be retrieved from
          the database in order to use their values for sorting, they are 'pulled up' into the result
          set, but marked specially to show that they are present only for sorting purposes,
          but not for use in the actual results.

          In the query in this issue

          select random() from t order by random()

          I think that the problem is related to the fact that we don't recognize that the two
          random() expressions are equivalent, and so we treat this as a case where we are
          'pulling up' a second column into the result set, instead of just setting the order by
          column to refer to the column already present in the result set.

          Show
          Bryan Pendleton added a comment - Hi Nirmal, I think it would be great if you can look at this problem. You should mark the entry as 'assigned' to you while you are studying it to let others know (you can unassign it later if you so choose). I think that you'll find that it is helpful to study the code in OrderByColumn.java and OrderByList.java. In particular, you'll want to understand the concept of "pulled up" columns, which is a construct that Derby uses to handle cases like select a,b from t order by c, d Since columns c and d don't appear in the select list, but need to be retrieved from the database in order to use their values for sorting, they are 'pulled up' into the result set, but marked specially to show that they are present only for sorting purposes, but not for use in the actual results. In the query in this issue select random() from t order by random() I think that the problem is related to the fact that we don't recognize that the two random() expressions are equivalent, and so we treat this as a case where we are 'pulling up' a second column into the result set, instead of just setting the order by column to refer to the column already present in the result set.
          Hide
          Nirmal Fernando added a comment -

          Thanks for the information Bryan they'll be really useful !!

          Can someone please tell me how can I assign my self to this issue? In work log it says following,
          Worklog:
          You don't have permission to work on this issue.

          Show
          Nirmal Fernando added a comment - Thanks for the information Bryan they'll be really useful !! Can someone please tell me how can I assign my self to this issue? In work log it says following, Worklog: You don't have permission to work on this issue.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Nirmal,
          I've added you to the derby-developers group in JIRA, so now you should be able to assign the issue to yourself.
          Thanks.

          Show
          Knut Anders Hatlen added a comment - Hi Nirmal, I've added you to the derby-developers group in JIRA, so now you should be able to assign the issue to yourself. Thanks.
          Hide
          Nirmal Fernando added a comment -

          Thanks Knut, I assigned my self to this issue.

          Show
          Nirmal Fernando added a comment - Thanks Knut, I assigned my self to this issue.
          Hide
          Knut Anders Hatlen added a comment -

          My initial analysis of the case "select random() as r from t order by
          random()" wasn't quite accurate. Here are some extra details I've
          found when studying the SQL standard more closely:

          "select random() as r from t order by random()" is not standard
          SQL. When I quoted syntax rule 18 above, I omitted these sub-rules:

          > 18) If an <order by clause> is specified, then:
          > a) Let OBC be the <order by clause>. Let NSK be the number of <sort
          > specification>s in OBC. For each i between 1 (one) and NSK, let Ki be
          > the <sort key> contained in the i-th <sort specification> in OBC.
          > b) Each Ki shall contain a <column reference> and shall not contain a
          > <subquery> or a <set function specification>.

          Since "random()" does not contain a <column reference>, this ORDER BY
          clause is not valid syntax. Adding a column reference to the sort key
          makes it valid, so I think these statements use valid syntax:

          i) SELECT RANDOM()*A AS R FROM MYTABLE ORDER BY RANDOM()*A
          ii) SELECT RANDOM()*A AS R, A FROM MYTABLE ORDER BY RANDOM()*A

          For statement , the sort key contains a column reference to a
          column that is not a column of T (column A), so the above quoted
          syntax rule 18)d)i) comes into play. The results of that statement
          should therefore be sorted on R.

          For statement (ii), the opposite is true. Since column A is included
          in the select list, the sort key does not contain a column reference
          to a column that is not a column of T. Rule 18)d)i) does therefore not
          apply, and the sort key should not be replaced by R. This, I think,
          means the results of this statement should not be sorted on R.

          At least, that's how I interpret the wording in the standard. I have
          no idea why it has this apparent asymmetry. Others may have a
          different interpretation, though.

          Show
          Knut Anders Hatlen added a comment - My initial analysis of the case "select random() as r from t order by random()" wasn't quite accurate. Here are some extra details I've found when studying the SQL standard more closely: "select random() as r from t order by random()" is not standard SQL. When I quoted syntax rule 18 above, I omitted these sub-rules: > 18) If an <order by clause> is specified, then: > a) Let OBC be the <order by clause>. Let NSK be the number of <sort > specification>s in OBC. For each i between 1 (one) and NSK, let Ki be > the <sort key> contained in the i-th <sort specification> in OBC. > b) Each Ki shall contain a <column reference> and shall not contain a > <subquery> or a <set function specification>. Since "random()" does not contain a <column reference>, this ORDER BY clause is not valid syntax. Adding a column reference to the sort key makes it valid, so I think these statements use valid syntax: i) SELECT RANDOM()*A AS R FROM MYTABLE ORDER BY RANDOM()*A ii) SELECT RANDOM()*A AS R, A FROM MYTABLE ORDER BY RANDOM()*A For statement , the sort key contains a column reference to a column that is not a column of T (column A), so the above quoted syntax rule 18)d)i) comes into play. The results of that statement should therefore be sorted on R. For statement (ii), the opposite is true. Since column A is included in the select list, the sort key does not contain a column reference to a column that is not a column of T. Rule 18)d)i) does therefore not apply, and the sort key should not be replaced by R. This, I think, means the results of this statement should not be sorted on R. At least, that's how I interpret the wording in the standard. I have no idea why it has this apparent asymmetry. Others may have a different interpretation, though.
          Hide
          Nirmal Fernando added a comment -

          Hi,

          After went through the source code and after debugging, I found some interesting points.

          ==================================================
          ij> select random() from t order by random();

          ascending; true
          addedColumnOffset: -1
          columnPosition: -1

          expression instanceof ColumnReference = false //random() is not a column
          expression.isConstantExpression() = false //random() is not a constant expression eg: x*2

          ascending; true
          addedColumnOffset: 0
          columnPosition: 2

          obc.getResultColumn().getExpression() instanceof ColumnReference = false //Ordering on an expression

          1
          ----------------------
          0.7460335991462618
          0.22942514814484716
          0.04022635294210675
          0.39009678758326427
          0.9438511615474983
          0.803614720479578
          0.7563735003656935

          7 rows selected
          ===========================================

          In OrderByColumn class at bindOrderByColumn(ResultSetNode target, OrderByList oblist) method, Derby doesn't check for an expression (eg: random() ) in an order by clause. Therefore it pulled up a column.

          I have few questions:

          1) How derby resolves random() function?
          2) Does derby consider random() as a expression or else other (it can't be a column reference)?

          Show
          Nirmal Fernando added a comment - Hi, After went through the source code and after debugging, I found some interesting points. ================================================== ij> select random() from t order by random(); ascending; true addedColumnOffset: -1 columnPosition: -1 expression instanceof ColumnReference = false //random() is not a column expression.isConstantExpression() = false //random() is not a constant expression eg: x*2 ascending; true addedColumnOffset: 0 columnPosition: 2 obc.getResultColumn().getExpression() instanceof ColumnReference = false //Ordering on an expression 1 ---------------------- 0.7460335991462618 0.22942514814484716 0.04022635294210675 0.39009678758326427 0.9438511615474983 0.803614720479578 0.7563735003656935 7 rows selected =========================================== In OrderByColumn class at bindOrderByColumn(ResultSetNode target, OrderByList oblist) method, Derby doesn't check for an expression (eg: random() ) in an order by clause. Therefore it pulled up a column. I have few questions: 1) How derby resolves random() function? 2) Does derby consider random() as a expression or else other (it can't be a column reference)?
          Hide
          Nirmal Fernando added a comment -

          Hi Knut, thanks for your clarification on 28th.

          I checked whether the columnNameMatches(String columnName) method in ResultColumn class is called after the above query, but it did not. I think the reason is Derby only check for matching column names, but in this case random() is not identified as a "ColumnReference".

          Like to get your views !!

          Show
          Nirmal Fernando added a comment - Hi Knut, thanks for your clarification on 28th. I checked whether the columnNameMatches(String columnName) method in ResultColumn class is called after the above query, but it did not. I think the reason is Derby only check for matching column names, but in this case random() is not identified as a "ColumnReference". Like to get your views !!
          Hide
          Bryan Pendleton added a comment -

          Derby definitely considers random() to be an expression, not a column reference. There is a method ValueNode.isEquivalent() that can be
          used to ask if two ValueNodes are equivalent expressions, so I think
          that it should be possible to compare the two occurrences of random()
          using isEquivalent. Random(), I believe, is a pre-defined built-in function
          that Derby provides, using a mechanism that you can see in
          DataDictionaryImpl.java

          Show
          Bryan Pendleton added a comment - Derby definitely considers random() to be an expression, not a column reference. There is a method ValueNode.isEquivalent() that can be used to ask if two ValueNodes are equivalent expressions, so I think that it should be possible to compare the two occurrences of random() using isEquivalent. Random(), I believe, is a pre-defined built-in function that Derby provides, using a mechanism that you can see in DataDictionaryImpl.java
          Hide
          Nirmal Fernando added a comment -

          Hi,

          In OrderByColumn class at bindOrderByColumn(ResultSetNode target, OrderByList oblist) method, the parameter oblist (OrderByList ) should contain the random() expression if we queried "select random() from t order by random();". So, can someone help me to check whether this oblist contains random()? I tried to print it in the console, by using oblist.elementAt.toString() and several different statements, but not successful.

          Thanks !!

          Show
          Nirmal Fernando added a comment - Hi, In OrderByColumn class at bindOrderByColumn(ResultSetNode target, OrderByList oblist) method, the parameter oblist (OrderByList ) should contain the random() expression if we queried "select random() from t order by random();". So, can someone help me to check whether this oblist contains random()? I tried to print it in the console, by using oblist.elementAt .toString() and several different statements, but not successful. Thanks !!
          Hide
          Nirmal Fernando added a comment -

          Hi,

          I found that the value of ValueNode: "expression" in OrderByColumn.java is "dataTypeServices: null" after executing above query (i.e. when order by contains random()).
          I checked with other different occurrences too. following are the details:

          select random() from t order by x --> expression = columnName: X
          select x from t order by x*-2 --> expression = operator: *

          Any suggestions regarding this would be great !!

          Thanks !!

          Show
          Nirmal Fernando added a comment - Hi, I found that the value of ValueNode: "expression" in OrderByColumn.java is "dataTypeServices: null" after executing above query (i.e. when order by contains random()). I checked with other different occurrences too. following are the details: select random() from t order by x --> expression = columnName: X select x from t order by x*-2 --> expression = operator: * Any suggestions regarding this would be great !! Thanks !!
          Hide
          Bryan Pendleton added a comment -

          Hi Nirmal,

          I think you will find that the handling of random() is fairly tricky,
          because I think it is a Derby-specific behavior in which the SQL
          statement gets compiled into something which is able to call
          java.lang.Math.random() at run time, via some special Derby-Java
          glue code.

          I think that the object in question may be a "UserType", which is
          a class that knows how to invoke user-defined Java code that
          has been defined as an extension data type, which is what random() is.

          I suspect it may be easier to get an understanding of the data
          structures if you first have a close look at simpler situations,
          as you have been doing so far.

          In particular, I think you could probably work up to the problem like this:

          1) select x from t order by x
          2) select x, x*2 from t order by x*2
          3) select random() from t order by random()

          The goal in all 3 cases, I believe, is to enhance Derby so that it
          can recognize that the OrderByColumn is referencing an expression
          which is equivalent to a ResultColumn that is already present in
          the SELECT, and so we don't need to duplicate the processing
          for the OrderByColumn at runtime, but can rather arrange to have
          the OrderByColumn directly reference the ResultColumn.

          Once we have arranged for the OrderByColumn to directly reference
          the ResultColumn, the original script (select random() order by random())
          will be solved, because at runtime there won't be two random()s,
          only one).

          However, note that sometimes we can't have the OrderByColumn
          directly reference the ResultColumn, because we will see legitimate
          statements like:

          select name from employee order by id

          In such a case, the two columns are distinct and have to be processed separately.

          Show
          Bryan Pendleton added a comment - Hi Nirmal, I think you will find that the handling of random() is fairly tricky, because I think it is a Derby-specific behavior in which the SQL statement gets compiled into something which is able to call java.lang.Math.random() at run time, via some special Derby-Java glue code. I think that the object in question may be a "UserType", which is a class that knows how to invoke user-defined Java code that has been defined as an extension data type, which is what random() is. I suspect it may be easier to get an understanding of the data structures if you first have a close look at simpler situations, as you have been doing so far. In particular, I think you could probably work up to the problem like this: 1) select x from t order by x 2) select x, x*2 from t order by x*2 3) select random() from t order by random() The goal in all 3 cases, I believe, is to enhance Derby so that it can recognize that the OrderByColumn is referencing an expression which is equivalent to a ResultColumn that is already present in the SELECT, and so we don't need to duplicate the processing for the OrderByColumn at runtime, but can rather arrange to have the OrderByColumn directly reference the ResultColumn. Once we have arranged for the OrderByColumn to directly reference the ResultColumn, the original script (select random() order by random()) will be solved, because at runtime there won't be two random()s, only one). However, note that sometimes we can't have the OrderByColumn directly reference the ResultColumn, because we will see legitimate statements like: select name from employee order by id In such a case, the two columns are distinct and have to be processed separately.
          Hide
          Nirmal Fernando added a comment -

          Hi,

          What about doing this procedure:

          We must avoid pulling up random() in an order by clause, without perform a check for a random(). In OrderByColumn.java class, inside method pullUpOrderByColumn(ResultSetNode target), currently it only checks for the ColumnReference; since random() is not a ColumnReference instance it fails there, and that leads to pull up the column straight away without checking its existence in RCL.

          So is it a good way to add another else if condition there and checks for a random() expression, then we can send this random() to perform a check for its existence in RCL?

          Thanks !!

          Show
          Nirmal Fernando added a comment - Hi, What about doing this procedure: We must avoid pulling up random() in an order by clause, without perform a check for a random(). In OrderByColumn.java class, inside method pullUpOrderByColumn(ResultSetNode target), currently it only checks for the ColumnReference; since random() is not a ColumnReference instance it fails there, and that leads to pull up the column straight away without checking its existence in RCL. So is it a good way to add another else if condition there and checks for a random() expression, then we can send this random() to perform a check for its existence in RCL? Thanks !!
          Hide
          Nirmal Fernando added a comment -

          Hi Knut & Bryan,

          I did small observation and found out that "(expression.toString().equals("dataTypeServices: null\n")) && (expression.getColumnName()==null) = true" is only occur when we query a random() in order by clause at the point of execution of "pullUpOrderByColumn" method in OrderByColumn class.
          So, can't we check for that inside that method and do necessary steps after that in order to make random() works fine, if it appeared in both select clause and order by clause?

          I highly appreciate your help and view thoughts!!!

          Thanks !!

          Show
          Nirmal Fernando added a comment - Hi Knut & Bryan, I did small observation and found out that "(expression.toString().equals("dataTypeServices: null\n")) && (expression.getColumnName()==null) = true" is only occur when we query a random() in order by clause at the point of execution of "pullUpOrderByColumn" method in OrderByColumn class. So, can't we check for that inside that method and do necessary steps after that in order to make random() works fine, if it appeared in both select clause and order by clause? I highly appreciate your help and view thoughts!!! Thanks !!
          Hide
          Bryan Pendleton added a comment -

          Hi Nirmal,

          It sounds like you have some ideas about some possible fixes.

          Why don't you try writing a potential fix, and attach it to this issue as a patch proposal,
          which will give us something more specific to discuss?

          Also, when you write your fix, it would be great if you can also:
          1) Run the current regression tests with your fix in place, and report the results
          2) Include as part of your patch some additional regression test changes which
          verify that this issue is fixed

          If you need help enhancing the tests, just let us know what problems you encounter.

          Show
          Bryan Pendleton added a comment - Hi Nirmal, It sounds like you have some ideas about some possible fixes. Why don't you try writing a potential fix, and attach it to this issue as a patch proposal, which will give us something more specific to discuss? Also, when you write your fix, it would be great if you can also: 1) Run the current regression tests with your fix in place, and report the results 2) Include as part of your patch some additional regression test changes which verify that this issue is fixed If you need help enhancing the tests, just let us know what problems you encounter.
          Hide
          Nirmal Fernando added a comment -

          Hi,

          Bryan, I think I almost fixed this bug, but I have a question like this:

          What is the purpose of following:

          if( SanityManager.DEBUG)

          { SanityManager.ASSERT( addedColumnOffset >= 0, "Order by expression was not pulled into the result column list"); }

          If I keep this in the code, a runtime exception fires saying that "ASSERT FAILED Order by expression was not pulled into the result column list...", But if I comment this and run, the end result is displayed correctly.

          What shall I do?

          Thanks !!!

          Show
          Nirmal Fernando added a comment - Hi, Bryan, I think I almost fixed this bug, but I have a question like this: What is the purpose of following: if( SanityManager.DEBUG) { SanityManager.ASSERT( addedColumnOffset >= 0, "Order by expression was not pulled into the result column list"); } If I keep this in the code, a runtime exception fires saying that "ASSERT FAILED Order by expression was not pulled into the result column list...", But if I comment this and run, the end result is displayed correctly. What shall I do? Thanks !!!
          Hide
          Bryan Pendleton added a comment -

          Sanity manager assertions are used to check situations that we believe
          should never occur. They are placed in the code by programmers to
          try to catch "impossible" situations. The assertions typically try to
          give a more helpful error message about a problem in the data structures,
          instead of the NullPointerAssertion or IndexOutOfRangeError that might
          occur without the assertion.

          If the code is changed, in such a way that the assertion no longer holds,
          then it is completely reasonable to remove the assertion.

          In this case, if your changes mean that it is no longer necessary to
          require that the addedColumnOffset is non-negative, then your patch
          should remove the assertion.

          Show
          Bryan Pendleton added a comment - Sanity manager assertions are used to check situations that we believe should never occur. They are placed in the code by programmers to try to catch "impossible" situations. The assertions typically try to give a more helpful error message about a problem in the data structures, instead of the NullPointerAssertion or IndexOutOfRangeError that might occur without the assertion. If the code is changed, in such a way that the assertion no longer holds, then it is completely reasonable to remove the assertion. In this case, if your changes mean that it is no longer necessary to require that the addedColumnOffset is non-negative, then your patch should remove the assertion.
          Hide
          Knut Anders Hatlen added a comment -

          Before we spend too much time implementing a solution for this issue,
          I think we need to take a step back and clarify what the solution
          should look like.

          Here are some questions I think we should try to answer before we go
          further:

          1) SELECT RANDOM() AS R FROM T ORDER BY RANDOM()

          For this statement (taken from the bug description), we don't get any
          guidance from the SQL standard, because the SQL standard doesn't allow
          sort keys that don't have a column reference. This means that changing
          how the result from the query is ordered, does not make it any more
          standards compliant. So does it make any sense to change it?

          2) SELECT A*RANDOM() AS R FROM T ORDER BY A*RANDOM()

          How should this query be ordered? Based on my interpretation in an
          earlier comment (28/Mar/10), I believe that the results should be
          sorted on R. Derby currently does not sort the results on R. So, do
          others agree with my previous interpretation that the correct way to
          handle this query, is to order the results by R? And if so, can/should
          we change it?

          3) SELECT A, A*RANDOM() AS R FROM T ORDER BY A*RANDOM()

          My interpretation (comment dated 28/Mar/10) is that this query should
          not be ordered by R, and Derby currently behaves in accordance with
          that interpretation. Do others agree with this interpretation? If so,
          we must make sure that if we change the behaviour of (1) and/or (2),
          we don't also change the behaviour of (3).

          Show
          Knut Anders Hatlen added a comment - Before we spend too much time implementing a solution for this issue, I think we need to take a step back and clarify what the solution should look like. Here are some questions I think we should try to answer before we go further: 1) SELECT RANDOM() AS R FROM T ORDER BY RANDOM() For this statement (taken from the bug description), we don't get any guidance from the SQL standard, because the SQL standard doesn't allow sort keys that don't have a column reference. This means that changing how the result from the query is ordered, does not make it any more standards compliant. So does it make any sense to change it? 2) SELECT A*RANDOM() AS R FROM T ORDER BY A*RANDOM() How should this query be ordered? Based on my interpretation in an earlier comment (28/Mar/10), I believe that the results should be sorted on R. Derby currently does not sort the results on R. So, do others agree with my previous interpretation that the correct way to handle this query, is to order the results by R? And if so, can/should we change it? 3) SELECT A, A*RANDOM() AS R FROM T ORDER BY A*RANDOM() My interpretation (comment dated 28/Mar/10) is that this query should not be ordered by R, and Derby currently behaves in accordance with that interpretation. Do others agree with this interpretation? If so, we must make sure that if we change the behaviour of (1) and/or (2), we don't also change the behaviour of (3).
          Hide
          Nirmal Fernando added a comment -

          Hi,

          Knut thanks for commenting on the issue, and I am waiting for the comments from the Derby community on the three scenarios that Knut pointed out.

          And to add this I would like to get your comments on the query "select random() from t order by random()", as I think we need a bug fix for this query as well.

          Thanks

          Show
          Nirmal Fernando added a comment - Hi, Knut thanks for commenting on the issue, and I am waiting for the comments from the Derby community on the three scenarios that Knut pointed out. And to add this I would like to get your comments on the query "select random() from t order by random()", as I think we need a bug fix for this query as well. Thanks
          Hide
          Knut Anders Hatlen added a comment -

          Hi Nirmal,

          The query "select random() from t order by random()" is in the same category as query (1) in my previous comment. That is, because the search key random() does not contain a column reference, it is not a valid SQL statement (per syntax rule 18b quoted in my comment of 28/Mar/10).

          Show
          Knut Anders Hatlen added a comment - Hi Nirmal, The query "select random() from t order by random()" is in the same category as query (1) in my previous comment. That is, because the search key random() does not contain a column reference, it is not a valid SQL statement (per syntax rule 18b quoted in my comment of 28/Mar/10).
          Hide
          Lily Wei added a comment -

          If the goal is for Derby to recognize OrderByColumn is referencing an expression which is equivalent to ResultColumn that is already present in the SELECT and yet follow SQL standard as much as we can. I agree (+1) we must make sure that if we change OrderByColumn behavior for use case (1) and/or (2) and we don't change the behavior for use case (3).

          Show
          Lily Wei added a comment - If the goal is for Derby to recognize OrderByColumn is referencing an expression which is equivalent to ResultColumn that is already present in the SELECT and yet follow SQL standard as much as we can. I agree (+1) we must make sure that if we change OrderByColumn behavior for use case (1) and/or (2) and we don't change the behavior for use case (3).
          Hide
          Nirmal Fernando added a comment -

          Hi,

          I found that,
          1) "select random() from t order by random()" == "select random() as r from t order by random()" is considered as a valid query in Postgre SQL.

          2) "SELECT i*RANDOM() AS R FROM T ORDER BY i*RANDOM()": In Postgre SQL they're ordering this by R.

          3)"SELECT i, i*RANDOM() AS R FROM T ORDER BY i*RANDOM() ": In Postgre SQL they're ordering the result set by i*RANDOM() that means by R.

          Show
          Nirmal Fernando added a comment - Hi, I found that, 1) "select random() from t order by random()" == "select random() as r from t order by random()" is considered as a valid query in Postgre SQL. 2) "SELECT i*RANDOM() AS R FROM T ORDER BY i*RANDOM()": In Postgre SQL they're ordering this by R. 3)"SELECT i, i*RANDOM() AS R FROM T ORDER BY i*RANDOM() ": In Postgre SQL they're ordering the result set by i*RANDOM() that means by R.
          Hide
          Bryan Pendleton added a comment -

          The more I think about it, the more I'm tempted to just leave the current Derby
          behavior alone in this area. The idea of ordering by RANDOM is rather hypothetical,
          and I'm having trouble coming up with a real-life situation in which I would use this
          function in this way. In practice, I think that if I was trying to do random data
          processing (e.g., for statistical simulations, or for generating test data), I would
          be content to generate the random data and store it into the database, then
          subsequently sort by the actual columns which held the randomly-generated data, thus
          avoiding this problem entirely.

          It seems like the main point of DERBY-4406 is that our current implementation
          has problems when using expressions in the ORDER BY clause which are either:

          • non-deterministic, or
          • have side effects
            In either of these situations the user might particularly care about invoking the
            function once-per-row in the query, not more-than-once.

          Are there functions other than RANDOM() which exhibit this behavior?

          Since the consensus appears to be that the original query was not standard SQL,
          the fact that we implement it differently than other databases is OK.

          So I'm leaning toward the conclusion that we should resolve this as "won't fix".

          Show
          Bryan Pendleton added a comment - The more I think about it, the more I'm tempted to just leave the current Derby behavior alone in this area. The idea of ordering by RANDOM is rather hypothetical, and I'm having trouble coming up with a real-life situation in which I would use this function in this way. In practice, I think that if I was trying to do random data processing (e.g., for statistical simulations, or for generating test data), I would be content to generate the random data and store it into the database, then subsequently sort by the actual columns which held the randomly-generated data, thus avoiding this problem entirely. It seems like the main point of DERBY-4406 is that our current implementation has problems when using expressions in the ORDER BY clause which are either: non-deterministic, or have side effects In either of these situations the user might particularly care about invoking the function once-per-row in the query, not more-than-once. Are there functions other than RANDOM() which exhibit this behavior? Since the consensus appears to be that the original query was not standard SQL, the fact that we implement it differently than other databases is OK. So I'm leaning toward the conclusion that we should resolve this as "won't fix".
          Hide
          Knut Anders Hatlen added a comment -

          I'm leaning toward the same conclusion as Bryan, that we should close the issue as "won't fix". Users can always specify explicitly which behaviour they want by using correlation names, which will also make the SQL statements clearer. So it's probably not worth the effort or the risk of breaking old applications that depend on the current behaviour.

          > Are there functions other than RANDOM() which exhibit this behavior?

          No built-in functions that I'm aware of, but you can create non-deterministic user-defined functions which behave similarly. (After DERBY-3570, it's even possible to tell CREATE FUNCTION whether or not the function is deterministic.)

          Show
          Knut Anders Hatlen added a comment - I'm leaning toward the same conclusion as Bryan, that we should close the issue as "won't fix". Users can always specify explicitly which behaviour they want by using correlation names, which will also make the SQL statements clearer. So it's probably not worth the effort or the risk of breaking old applications that depend on the current behaviour. > Are there functions other than RANDOM() which exhibit this behavior? No built-in functions that I'm aware of, but you can create non-deterministic user-defined functions which behave similarly. (After DERBY-3570 , it's even possible to tell CREATE FUNCTION whether or not the function is deterministic.)
          Hide
          Lily Wei added a comment -

          The detail explanation clearly explains why we are leading toward to not fixing it. Just out of curiosity, will performance be better if we ordering by RANDOM or it will pretty much be the same as using correlation names. I am not saying performance will be better. I just think this is a best time to evaluate the benefit if we can gain some.

          Show
          Lily Wei added a comment - The detail explanation clearly explains why we are leading toward to not fixing it. Just out of curiosity, will performance be better if we ordering by RANDOM or it will pretty much be the same as using correlation names. I am not saying performance will be better. I just think this is a best time to evaluate the benefit if we can gain some.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Lily,

          If you have two equivalent queries that only differ in whether they use correlation names, the performance should be identical. I think the compiler also adds auto-generated correlation names internally if there is no explicit correlation name given, so there should be no difference at execution time.

          Show
          Knut Anders Hatlen added a comment - Hi Lily, If you have two equivalent queries that only differ in whether they use correlation names, the performance should be identical. I think the compiler also adds auto-generated correlation names internally if there is no explicit correlation name given, so there should be no difference at execution time.
          Hide
          Lily Wei added a comment -

          Hi Knut:
          Thank you for exploring my curiosity. If there is no difference in compiler or execution time on using correlation names or not, there should be no performance concern. There is very little point to change anything from the current behavior. Lily

          Show
          Lily Wei added a comment - Hi Knut: Thank you for exploring my curiosity. If there is no difference in compiler or execution time on using correlation names or not, there should be no performance concern. There is very little point to change anything from the current behavior. Lily
          Hide
          Nirmal Fernando added a comment -

          I am closing this issue, as "won't fix".

          Thanks.

          Show
          Nirmal Fernando added a comment - I am closing this issue, as "won't fix". Thanks.

            People

            • Assignee:
              Nirmal Fernando
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development