Derby
  1. Derby
  2. DERBY-6017

IN lists with mixed types may return wrong results

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.9.1.0
    • Fix Version/s: 10.8.3.3, 10.9.2.2, 10.10.1.1
    • Component/s: SQL
    • Labels:
      None
    • Issue & fix info:
      Repro attached
    • Bug behavior facts:
      Wrong query result

      Description

      Given this table:

      ij> connect 'jdbc:derby:memory:db;create=true';
      ij> create table t(x bigint);
      0 rows inserted/updated/deleted
      ij> insert into t values 9223372036854775805, 9223372036854775806, 9223372036854775807;
      3 rows inserted/updated/deleted

      A query that uses an IN list that contains all the three values actually stored in the table, returns all three rows as expected:

      ij> select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807);
      X
      --------------------
      9223372036854775805
      9223372036854775806
      9223372036854775807

      3 rows selected

      However, if we add a value whose type precedence is higher, like a DOUBLE value, and that value happens to be equal to the approximation of the other values in the IN list when they are cast from BIGINT to DOUBLE, only one row is returned:

      ij> select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);
      X
      --------------------
      9223372036854775805

      1 row selected

      I believe this query should return all three rows too.

      1. d6017-1a-duplicates.diff
        8 kB
        Knut Anders Hatlen
      2. d6017-2a-cast-if-needed.diff
        12 kB
        Knut Anders Hatlen
      3. d6017-3a-weme-test-failure.diff
        2 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Kathey Marsden added a comment -

          Assign back to Knut after backport.

          Show
          Kathey Marsden added a comment - Assign back to Knut after backport.
          Hide
          ASF subversion and git services added a comment -

          Commit 1511288 from Kathey Marsden in branch 'code/branches/10.8'
          [ https://svn.apache.org/r1511288 ]

          DERBY-6017 IN lists with mixed types may return wrong results

          merged from trunk 1424889, 1448025, 1450695
          Contrubuted by Knut Anders Hatlen

          Show
          ASF subversion and git services added a comment - Commit 1511288 from Kathey Marsden in branch 'code/branches/10.8' [ https://svn.apache.org/r1511288 ] DERBY-6017 IN lists with mixed types may return wrong results merged from trunk 1424889, 1448025, 1450695 Contrubuted by Knut Anders Hatlen
          Hide
          ASF subversion and git services added a comment -

          Commit 1510609 from Kathey Marsden in branch 'code/branches/10.9'
          [ https://svn.apache.org/r1510609 ]

          DERBY-6017 IN lists with mixed types may return wrong results

          merge from trunk revisions 1424889, 1448025, 1450695
          Contributed by Knut Anders Hatlen

          Show
          ASF subversion and git services added a comment - Commit 1510609 from Kathey Marsden in branch 'code/branches/10.9' [ https://svn.apache.org/r1510609 ] DERBY-6017 IN lists with mixed types may return wrong results merge from trunk revisions 1424889, 1448025, 1450695 Contributed by Knut Anders Hatlen
          Hide
          Kathey Marsden added a comment -

          Assign to myself for backport

          Show
          Kathey Marsden added a comment - Assign to myself for backport
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Myrna. Closing the issue again.

          Show
          Knut Anders Hatlen added a comment - Thanks, Myrna. Closing the issue again.
          Hide
          Myrna van Lunteren added a comment -

          Looks like the run for weme was clean last night, using a build of trunk sync-ed up to revision 1451076. Thanks for noticing this failure, and fixing it.

          Show
          Myrna van Lunteren added a comment - Looks like the run for weme was clean last night, using a build of trunk sync-ed up to revision 1451076. Thanks for noticing this failure, and fixing it.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching d6017-3a-weme-test-failure.diff which makes the test case use a different assert method that checks the actual numeric values returned by the result set, rather than just checking the string representation of the values. Hopefully, this is enough to get the test to pass on weme.

          Committed revision 1450695.

          I'll keep this bug open until there has been a clean run on weme in the nightly tests. I don't have a copy of weme to test it on myself.

          Show
          Knut Anders Hatlen added a comment - Attaching d6017-3a-weme-test-failure.diff which makes the test case use a different assert method that checks the actual numeric values returned by the result set, rather than just checking the string representation of the values. Hopefully, this is enough to get the test to pass on weme. Committed revision 1450695. I'll keep this bug open until there has been a clean run on weme in the nightly tests. I don't have a copy of weme to test it on myself.
          Hide
          Knut Anders Hatlen added a comment -

          The test for this issue is failing on weme. For example here: http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/weme6.2/1450000-suites.All_diff.txt

          There was 1 failure:
          1) testMixedTypes(org.apache.derbyTesting.functionTests.tests.lang.InPredicateTest)junit.framework.AssertionFailedError: Column value mismatch @ column 'R1', row 1:
          Expected: >2.14748365E9<
          Found: >2.1474836E9<

          It looks like doubles are formatted differently on that platform. I'll see if I can make the test more robust against formatting differences.

          Show
          Knut Anders Hatlen added a comment - The test for this issue is failing on weme. For example here: http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/weme6.2/1450000-suites.All_diff.txt There was 1 failure: 1) testMixedTypes(org.apache.derbyTesting.functionTests.tests.lang.InPredicateTest)junit.framework.AssertionFailedError: Column value mismatch @ column 'R1', row 1: Expected: >2.14748365E9< Found: >2.1474836E9< It looks like doubles are formatted differently on that platform. I'll see if I can make the test more robust against formatting differences.
          Hide
          Knut Anders Hatlen added a comment -

          Committed revision 1448025.

          I've logged DERBY-6080 for the IN subquery anomaly. Resolving this issue.

          Show
          Knut Anders Hatlen added a comment - Committed revision 1448025. I've logged DERBY-6080 for the IN subquery anomaly. Resolving this issue.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching d6017-2a-cast-if-needed.diff which solves the remaining IN
          list problems discussed in this issue (except the second anomaly
          mentioned in my 18/Dec/12 21:58 comment, as that's an IN subquery, not
          an IN list).

          The 2a patch replaces the fix in the 1a patch with a more general one
          that also addresses the problems with non-constants.

          The problems are caused by the comparison operations being performed
          using the semantics of the dominant type of the two values being
          compared. It should be performed using the dominant type of all the
          values in the IN predicate.

          To fix this, the patch makes InListOperatorNode.preprocess() cast the
          left operand to the dominant type if some of the comparisons otherwise
          would have been performed using a non-dominant type.

          The fix works because the left operand is part of every single
          comparison operation, and as long as one of the values being compared
          is of the dominant type, the comparison will be performed using the
          semantics of that type.

          The cast is not added if the left operand is already of the dominant
          type, in which case the cast would be redundant. Also, it is not added
          if all the values in the right operand (the IN list) are of the
          dominant type. In that case, too, every single comparison operation
          would include one value of the dominant type, and the correct
          semantics would be used.

          In all other cases, a cast is added to ensure that the comparisons are
          performed using the right type.

          All the regression tests ran cleanly with the patch.

          Description of the changes:

          • engine/org/apache/derby/impl/sql/compile/InListOperatorNode.java
          • Made preprocess() cast the left operand to the dominant type if
            necessary.
          • Removed the cast added by the 1a patch for the case where all values
            in the IN list represented the same constant value, as that code was
            made redundant by the new fix.
          • engine/org/apache/derby/iapi/types/DataType.java
          • Improved comment about how types are handled when evaluating an IN
            predicate.
          • engine/org/apache/derby/iapi/types/DataTypeDescriptor.java
          • engine/org/apache/derby/impl/sql/compile/ValueNodeList.java
          • Removed unnecessary throws clauses. Did this so that I didn't have
            to add another unnecessary throws clause to a helper method I added
            to InListOperatorNode.
          • testing/org/apache/derbyTesting/functionTests/tests/lang/InPredicateTest.java
          • Added test cases from previous comments on this issue. They work as
            expected with the patch.
          Show
          Knut Anders Hatlen added a comment - Attaching d6017-2a-cast-if-needed.diff which solves the remaining IN list problems discussed in this issue (except the second anomaly mentioned in my 18/Dec/12 21:58 comment, as that's an IN subquery, not an IN list). The 2a patch replaces the fix in the 1a patch with a more general one that also addresses the problems with non-constants. The problems are caused by the comparison operations being performed using the semantics of the dominant type of the two values being compared. It should be performed using the dominant type of all the values in the IN predicate. To fix this, the patch makes InListOperatorNode.preprocess() cast the left operand to the dominant type if some of the comparisons otherwise would have been performed using a non-dominant type. The fix works because the left operand is part of every single comparison operation, and as long as one of the values being compared is of the dominant type, the comparison will be performed using the semantics of that type. The cast is not added if the left operand is already of the dominant type, in which case the cast would be redundant. Also, it is not added if all the values in the right operand (the IN list) are of the dominant type. In that case, too, every single comparison operation would include one value of the dominant type, and the correct semantics would be used. In all other cases, a cast is added to ensure that the comparisons are performed using the right type. All the regression tests ran cleanly with the patch. Description of the changes: engine/org/apache/derby/impl/sql/compile/InListOperatorNode.java Made preprocess() cast the left operand to the dominant type if necessary. Removed the cast added by the 1a patch for the case where all values in the IN list represented the same constant value, as that code was made redundant by the new fix. engine/org/apache/derby/iapi/types/DataType.java Improved comment about how types are handled when evaluating an IN predicate. engine/org/apache/derby/iapi/types/DataTypeDescriptor.java engine/org/apache/derby/impl/sql/compile/ValueNodeList.java Removed unnecessary throws clauses. Did this so that I didn't have to add another unnecessary throws clause to a helper method I added to InListOperatorNode. testing/org/apache/derbyTesting/functionTests/tests/lang/InPredicateTest.java Added test cases from previous comments on this issue. They work as expected with the patch.
          Hide
          Knut Anders Hatlen added a comment -

          Committed the 1a patch to trunk, revision 1424889.

          Show
          Knut Anders Hatlen added a comment - Committed the 1a patch to trunk, revision 1424889.
          Hide
          Knut Anders Hatlen added a comment -

          For the problem fixed by the 1a patch, I didn't find any problematic combinations except BIGINT/DOUBLE. As far as I can see REAL/DOUBLE doesn't have the problem because the conversion from REAL to DOUBLE never loses precision. And mixing INT and REAL is not a problem either, as the problem only affects IN lists that consists entirely of literals, and Derby's grammar doesn't have any way to express REAL literals (all approximate number literals get the type DOUBLE). You can of course cast a DOUBLE literal to a REAL, but then it's no longer a ConstantNode, and the problematic optimization is not applied.

          However, mixing REAL and INT seems to have similar problems as DOUBLE and BIGINT when using non-constant expressions in the IN list. Here's one example:

          ij> select * from tt where r1 in (i1, r2);
          I1 |R1 |R2
          ---------------------------------------
          2147483645 |2.14748365E9 |2.14748365E9
          2147483645 |2.14748365E9 |0.0

          2 rows selected
          ij> select * from tt where r1 in (values i1, r2);
          I1 |R1 |R2
          ---------------------------------------
          2147483645 |2.14748365E9 |2.14748365E9

          1 row selected

          The two queries are supposed to be equivalent, but they return different results.

          I think the former query performs both of the comparisons (r1=i1, r1=r2) using REAL semantics, whereas the latter query uses DOUBLE semantics because the (values i1,r2) sub-query has DOUBLE as its row type.

          Show
          Knut Anders Hatlen added a comment - For the problem fixed by the 1a patch, I didn't find any problematic combinations except BIGINT/DOUBLE. As far as I can see REAL/DOUBLE doesn't have the problem because the conversion from REAL to DOUBLE never loses precision. And mixing INT and REAL is not a problem either, as the problem only affects IN lists that consists entirely of literals, and Derby's grammar doesn't have any way to express REAL literals (all approximate number literals get the type DOUBLE). You can of course cast a DOUBLE literal to a REAL, but then it's no longer a ConstantNode, and the problematic optimization is not applied. However, mixing REAL and INT seems to have similar problems as DOUBLE and BIGINT when using non-constant expressions in the IN list. Here's one example: ij> select * from tt where r1 in (i1, r2); I1 |R1 |R2 --------------------------------------- 2147483645 |2.14748365E9 |2.14748365E9 2147483645 |2.14748365E9 |0.0 2 rows selected ij> select * from tt where r1 in (values i1, r2); I1 |R1 |R2 --------------------------------------- 2147483645 |2.14748365E9 |2.14748365E9 1 row selected The two queries are supposed to be equivalent, but they return different results. I think the former query performs both of the comparisons (r1=i1, r1=r2) using REAL semantics, whereas the latter query uses DOUBLE semantics because the (values i1,r2) sub-query has DOUBLE as its row type.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for looking at the patch, Bryan.

          I'll look into the other data types and see if I find similar problems. If promoting a REAL to a DOUBLE could cause loss of precision (can it?), they might show the same problem. Promoting an INT to a REAL could lose precision too, but I think we always promote to DOUBLE when mixing INTs and REALs (per [1]). It's worth testing, though.

          [1] http://db.apache.org/derby/docs/10.9/ref/rrefsqlj27767.html

          Show
          Knut Anders Hatlen added a comment - Thanks for looking at the patch, Bryan. I'll look into the other data types and see if I find similar problems. If promoting a REAL to a DOUBLE could cause loss of precision (can it?), they might show the same problem. Promoting an INT to a REAL could lose precision too, but I think we always promote to DOUBLE when mixing INTs and REALs (per [1] ). It's worth testing, though. [1] http://db.apache.org/derby/docs/10.9/ref/rrefsqlj27767.html
          Hide
          Bryan Pendleton added a comment -

          +1 to the duplicates patch.

          I notice that you've mostly been experimenting with values that become
          approximate when converted from LONG to DOUBLE.

          Are there similar problems when mixing the 4-byte-floating-point REAL
          with the 8-byte-floating-point DOUBLE?

          Or is it generally integer-to-floating-point where the most problems arise?

          Show
          Bryan Pendleton added a comment - +1 to the duplicates patch. I notice that you've mostly been experimenting with values that become approximate when converted from LONG to DOUBLE. Are there similar problems when mixing the 4-byte-floating-point REAL with the 8-byte-floating-point DOUBLE? Or is it generally integer-to-floating-point where the most problems arise?
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch (d6017-1a-duplicates.diff) for the problem we see when duplicates are eliminated from the IN list.

          The original code would detect that an IN list consisting of the values (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18) only contained one distinct value after conversion to the dominant type. It would arbitrarily pick the first value and replace the IN predicate with leftOperand =
          9223372036854775805.

          The patch changes this optimization so that it casts the right operand in the comparison to the dominant type if neither of the two operands already is of the dominant type.

          It also adds a test case to verify that the expected results are produced.

          All the regression tests ran cleanly with the patch.

          The patch doesn't address any of the other problems discussed in this issue.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch (d6017-1a-duplicates.diff) for the problem we see when duplicates are eliminated from the IN list. The original code would detect that an IN list consisting of the values (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18) only contained one distinct value after conversion to the dominant type. It would arbitrarily pick the first value and replace the IN predicate with leftOperand = 9223372036854775805. The patch changes this optimization so that it casts the right operand in the comparison to the dominant type if neither of the two operands already is of the dominant type. It also adds a test case to verify that the expected results are produced. All the regression tests ran cleanly with the patch. The patch doesn't address any of the other problems discussed in this issue.
          Hide
          Knut Anders Hatlen added a comment -

          I changed the title of the bug since the problem is not restricted to constants.

          Here's another example that we may add to our tests once the bug is fixed:

          ij> create table t3(b1 bigint, b2 bigint, d double);
          0 rows inserted/updated/deleted
          ij> insert into t3 values (9223372036854775805, 9223372036854775806, 1);
          1 row inserted/updated/deleted
          ij> select * from t3 where b1 in (b2, d);
          B1 |B2 |D
          ----------------------------------------------------------------

          0 rows selected
          ij> select * from t3 where b1 in (values b2, d);
          B1 |B2 |D
          ----------------------------------------------------------------
          9223372036854775805 |9223372036854775806 |1.0

          1 row selected

          The two select queries return different results, although the standard says they should be equivalent.

          Show
          Knut Anders Hatlen added a comment - I changed the title of the bug since the problem is not restricted to constants. Here's another example that we may add to our tests once the bug is fixed: ij> create table t3(b1 bigint, b2 bigint, d double); 0 rows inserted/updated/deleted ij> insert into t3 values (9223372036854775805, 9223372036854775806, 1); 1 row inserted/updated/deleted ij> select * from t3 where b1 in (b2, d); B1 |B2 |D ---------------------------------------------------------------- 0 rows selected ij> select * from t3 where b1 in (values b2, d); B1 |B2 |D ---------------------------------------------------------------- 9223372036854775805 |9223372036854775806 |1.0 1 row selected The two select queries return different results, although the standard says they should be equivalent.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Bryan.

          > Am I right in conceptualizing this as "we need to be implicitly casting the values to
          > the correct type in certain situations, and we're currently not doing so."?

          For the values in the IN list, I think that's the right way to conceptualize it.

          For the simple comparison operations (for example a predicate such as 9223372036854775805 = 9.223372036854776E18) I'm not so sure. But that's the part of the problem that I suggested we didn't focus on in this issue.

          I think it's a good idea to make our docs say that equality comparisons involving floating point values may have surprising results.

          Show
          Knut Anders Hatlen added a comment - Thanks, Bryan. > Am I right in conceptualizing this as "we need to be implicitly casting the values to > the correct type in certain situations, and we're currently not doing so."? For the values in the IN list, I think that's the right way to conceptualize it. For the simple comparison operations (for example a predicate such as 9223372036854775805 = 9.223372036854776E18) I'm not so sure. But that's the part of the problem that I suggested we didn't focus on in this issue. I think it's a good idea to make our docs say that equality comparisons involving floating point values may have surprising results.
          Hide
          Bryan Pendleton added a comment -

          A little bit of quick searching confirms my suspicion that most databases have this problem,
          and that comparison of floating point values can lead to confusing results:

          http://stackoverflow.com/questions/2567434/mysql-floating-point-comparison-issues

          http://yongjun-jiao.blogspot.com/2011/11/floating-point-number-equality.html

          https://kb.askmonty.org/en/numeric-operations/

          It might be worth, as a related issue or sub-task, considering ways to improve our
          documentation in this area so that we guide users toward safer ways of comparing
          floating point values, e.g., some form of:

          WHERE ABS(float_column - other_value) < small_epsilon

          with an appropriate suggestion for what small_epsilon should be.

          Show
          Bryan Pendleton added a comment - A little bit of quick searching confirms my suspicion that most databases have this problem, and that comparison of floating point values can lead to confusing results: http://stackoverflow.com/questions/2567434/mysql-floating-point-comparison-issues http://yongjun-jiao.blogspot.com/2011/11/floating-point-number-equality.html https://kb.askmonty.org/en/numeric-operations/ It might be worth, as a related issue or sub-task, considering ways to improve our documentation in this area so that we guide users toward safer ways of comparing floating point values, e.g., some form of: WHERE ABS(float_column - other_value) < small_epsilon with an appropriate suggestion for what small_epsilon should be.
          Hide
          Bryan Pendleton added a comment -

          Thank you Knut Anders for digging deep into the standard to explore these topics.

          I see no holes in your logic; it seems unavoidable that (a) these queries are intended to be
          clearly-defined by the standard, and (b) Derby is doing it wrong. I can wish that the
          standard were written differently, but it ain't so...

          I think your proposed approach is excellent.

          Hopefully the fact that we currently seem to behave correctly in the ANY queries and
          in the table value constructor (VALUES ... ) give some clues about what's needed to
          be included in the other queries to give them the right form.

          Am I right in conceptualizing this as "we need to be implicitly casting the values to
          the correct type in certain situations, and we're currently not doing so."?

          Show
          Bryan Pendleton added a comment - Thank you Knut Anders for digging deep into the standard to explore these topics. I see no holes in your logic; it seems unavoidable that (a) these queries are intended to be clearly-defined by the standard, and (b) Derby is doing it wrong. I can wish that the standard were written differently, but it ain't so... I think your proposed approach is excellent. Hopefully the fact that we currently seem to behave correctly in the ANY queries and in the table value constructor (VALUES ... ) give some clues about what's needed to be included in the other queries to give them the right form. Am I right in conceptualizing this as "we need to be implicitly casting the values to the correct type in certain situations, and we're currently not doing so."?
          Hide
          Knut Anders Hatlen added a comment -

          I've tried to interpret what the standard says. Here are the relevant parts I've found:

          > 8.4 <in predicate> - Syntax Rules
          >
          > 2) Let IVL be an <in value list>.
          > ( IVL )
          > is equivalent to the <table value constructor>:
          > ( VALUES IVL )

          So, according to this rule, the following two queries should be equivalent (which they are not currently):

          ij> select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);
          X
          --------------------
          9223372036854775805

          1 row selected
          ij> select * from t where x in (values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);
          X
          --------------------
          9223372036854775805
          9223372036854775806
          9223372036854775807

          3 rows selected

          Furthermore, it says:

          > 8.4 <in predicate> - Syntax Rules
          >
          > 5) The expression
          > RVC IN IPV
          > is equivalent to
          > RVC = ANY IPV

          So to find the correct semantics for IN, we need to rewrite the query to ANY. That is,

          select * from t where x = any (values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);

          and see what the standard says about that. (This particular ANY query returns three rows in Derby, which is the same as the IN (VALUES ...) query above.)

          This leads us to:

          > 8.8 <quantified comparison predicate> - Syntax Rules
          >
          > 1) Let RV1 and RV2 be <row value predicand>s whose declared types are respectively that of the <row value
          > predicand> and the row type of the <table subquery>. The Syntax Rules of Subclause 8.2, “<comparison
          > predicate>”, are applied to:
          > RV1 <comp op> RV2

          That is, for the comparisons, the value on the right hand side should have the row type of the sub-query.

          And the row type of our VALUES sub-query is DOUBLE (or at least some approximate numeric type) as 7.3 <table value constructor> says row type is determined by applying Subclause 9.3, “Data types of results of aggregations”, whose syntax rule 3d says:

          > If any data type in DTS is approximate numeric, then each data type in DTS shall be numeric and the
          > result data type is approximate numeric with implementation-defined precision.

          Derby does produce the right type for the <table value constructor>:

          ij> values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18;
          1
          ----------------------
          9.223372036854776E18
          9.223372036854776E18
          9.223372036854776E18
          9.223372036854776E18

          4 rows selected

          The ANY query should therefore end up like:

          select * from t where x = 9.223372036854776E18 or x = 9.223372036854776E18 or x = 9.223372036854776E18 or x = 9.223372036854776E18;

          Or even simpler, because the DOUBLE representation of all four values happens to be the same:

          select * from t where x = 9.223372036854776E18;

          Now, 8.2 <comparison predicate> - General Rules, says this:

          > 2) Numbers are compared with respect to their algebraic value.

          No more details than that, I'm afraid. And no mentioning about converting the operands to the dominant type, so far as I can see.

          Derby currently returns these three rows for the query:

          ij> select * from t where x = 9.223372036854776E18;
          X
          --------------------
          9223372036854775805
          9223372036854775806
          9223372036854775807

          3 rows selected

          I'm not completely convinced that all those three values have the same algebraic value as 9.223372036854776E18. But in any case I think changing how Derby performs numeric comparisons is outside the scope of this issue.

          So how's this for a plan? In this issue, let's assume Derby's equality comparison operator does the right thing. The goal for now should be to make an <in value list> behave the same way as the ANY query the SQL standard says it should be equivalent to. We should have tests that use the results from the equivalent ANY queries as canons, and those tests would also alert us if we later make changes to the comparison operator in a way that makes ANY and IN behave inconsistently.

          Show
          Knut Anders Hatlen added a comment - I've tried to interpret what the standard says. Here are the relevant parts I've found: > 8.4 <in predicate> - Syntax Rules > > 2) Let IVL be an <in value list>. > ( IVL ) > is equivalent to the <table value constructor>: > ( VALUES IVL ) So, according to this rule, the following two queries should be equivalent (which they are not currently): ij> select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18); X -------------------- 9223372036854775805 1 row selected ij> select * from t where x in (values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18); X -------------------- 9223372036854775805 9223372036854775806 9223372036854775807 3 rows selected Furthermore, it says: > 8.4 <in predicate> - Syntax Rules > > 5) The expression > RVC IN IPV > is equivalent to > RVC = ANY IPV So to find the correct semantics for IN, we need to rewrite the query to ANY. That is, select * from t where x = any (values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18); and see what the standard says about that. (This particular ANY query returns three rows in Derby, which is the same as the IN (VALUES ...) query above.) This leads us to: > 8.8 <quantified comparison predicate> - Syntax Rules > > 1) Let RV1 and RV2 be <row value predicand>s whose declared types are respectively that of the <row value > predicand> and the row type of the <table subquery>. The Syntax Rules of Subclause 8.2, “<comparison > predicate>”, are applied to: > RV1 <comp op> RV2 That is, for the comparisons, the value on the right hand side should have the row type of the sub-query. And the row type of our VALUES sub-query is DOUBLE (or at least some approximate numeric type) as 7.3 <table value constructor> says row type is determined by applying Subclause 9.3, “Data types of results of aggregations”, whose syntax rule 3d says: > If any data type in DTS is approximate numeric, then each data type in DTS shall be numeric and the > result data type is approximate numeric with implementation-defined precision. Derby does produce the right type for the <table value constructor>: ij> values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18; 1 ---------------------- 9.223372036854776E18 9.223372036854776E18 9.223372036854776E18 9.223372036854776E18 4 rows selected The ANY query should therefore end up like: select * from t where x = 9.223372036854776E18 or x = 9.223372036854776E18 or x = 9.223372036854776E18 or x = 9.223372036854776E18; Or even simpler, because the DOUBLE representation of all four values happens to be the same: select * from t where x = 9.223372036854776E18; Now, 8.2 <comparison predicate> - General Rules, says this: > 2) Numbers are compared with respect to their algebraic value. No more details than that, I'm afraid. And no mentioning about converting the operands to the dominant type, so far as I can see. Derby currently returns these three rows for the query: ij> select * from t where x = 9.223372036854776E18; X -------------------- 9223372036854775805 9223372036854775806 9223372036854775807 3 rows selected I'm not completely convinced that all those three values have the same algebraic value as 9.223372036854776E18. But in any case I think changing how Derby performs numeric comparisons is outside the scope of this issue. So how's this for a plan? In this issue, let's assume Derby's equality comparison operator does the right thing. The goal for now should be to make an <in value list> behave the same way as the ANY query the SQL standard says it should be equivalent to. We should have tests that use the results from the equivalent ANY queries as canons, and those tests would also alert us if we later make changes to the comparison operator in a way that makes ANY and IN behave inconsistently.
          Hide
          Knut Anders Hatlen added a comment -

          I think the query is allowed by the SQL standard. The first paragraph in SQL:2003, part 2, section 4.4.1 (Introduction to numbers) says: "A number is either an exact numeric value or an approximate numeric value. Any two numbers are comparable." So no such luck.

          We probably need to study the standard more closely to find out what the exact semantics are, though. In particular this: Is it the sorting or the binary search that uses the right kind of comparison. If it's the sorting (which uses the same kind of comparison for all the values, based on the dominant type), I suspect the problem also affects IN lists that don't have constants. For example:

          ij> create table t3(b1 bigint, b2 bigint, d double);
          0 rows inserted/updated/deleted
          ij> insert into t3 values (9223372036854775805, 9223372036854775806, 1);
          1 row inserted/updated/deleted
          ij> select * from t3 where b1 in (b2, d);
          B1 |B2 |D
          ----------------------------------------------------------------

          0 rows selected

          If it is correct that the dominant type should be used, I would have expected the above query to return one row, as there is a DOUBLE value in the IN list, and b1=b2 when they are converted to DOUBLE.

          Another puzzling result that doesn't involve constants, is this:

          ij> create table t4 (b bigint);
          0 rows inserted/updated/deleted
          ij> insert into t4 values 9223372036854775806, 9223372036854775807;
          2 rows inserted/updated/deleted
          ij> create table t5 (d double);
          0 rows inserted/updated/deleted
          ij> insert into t5 values 9.223372036854776E18;
          1 row inserted/updated/deleted
          ij> select * from t4 where b in (select d from t5);
          B
          --------------------
          9223372036854775807

          1 row selected
          ij> select * from t4 where b in (select cast(d as double) from t5);
          B
          --------------------
          9223372036854775806
          9223372036854775807

          2 rows selected

          Is it correct that the two queries should return different results? The only difference is that the first query accesses the D column with no cast, and the second one casts D to DOUBLE. But since D already is a DOUBLE column, I wouldn't expect the cast to make any difference.

          Show
          Knut Anders Hatlen added a comment - I think the query is allowed by the SQL standard. The first paragraph in SQL:2003, part 2, section 4.4.1 (Introduction to numbers) says: "A number is either an exact numeric value or an approximate numeric value. Any two numbers are comparable." So no such luck. We probably need to study the standard more closely to find out what the exact semantics are, though. In particular this: Is it the sorting or the binary search that uses the right kind of comparison. If it's the sorting (which uses the same kind of comparison for all the values, based on the dominant type), I suspect the problem also affects IN lists that don't have constants. For example: ij> create table t3(b1 bigint, b2 bigint, d double); 0 rows inserted/updated/deleted ij> insert into t3 values (9223372036854775805, 9223372036854775806, 1); 1 row inserted/updated/deleted ij> select * from t3 where b1 in (b2, d); B1 |B2 |D ---------------------------------------------------------------- 0 rows selected If it is correct that the dominant type should be used, I would have expected the above query to return one row, as there is a DOUBLE value in the IN list, and b1=b2 when they are converted to DOUBLE. Another puzzling result that doesn't involve constants, is this: ij> create table t4 (b bigint); 0 rows inserted/updated/deleted ij> insert into t4 values 9223372036854775806, 9223372036854775807; 2 rows inserted/updated/deleted ij> create table t5 (d double); 0 rows inserted/updated/deleted ij> insert into t5 values 9.223372036854776E18; 1 row inserted/updated/deleted ij> select * from t4 where b in (select d from t5); B -------------------- 9223372036854775807 1 row selected ij> select * from t4 where b in (select cast(d as double) from t5); B -------------------- 9223372036854775806 9223372036854775807 2 rows selected Is it correct that the two queries should return different results? The only difference is that the first query accesses the D column with no cast, and the second one casts D to DOUBLE. But since D already is a DOUBLE column, I wouldn't expect the cast to make any difference.
          Hide
          Bryan Pendleton added a comment -

          A very clear description, and a very clear analysis; thank you very much!

          The fact that floating point comparisons are approximate has always caught me up;
          I wish that SQL had made it illegal to perform an exact comparison ( "=", "IN", etc.)
          on a floating point type.

          Then we could have just declared this query illegal, and forced the user to think
          more clearly about what computation they were trying to express.

          Is there any hope for such a resolution in the SQL standard?

          Show
          Bryan Pendleton added a comment - A very clear description, and a very clear analysis; thank you very much! The fact that floating point comparisons are approximate has always caught me up; I wish that SQL had made it illegal to perform an exact comparison ( "=", "IN", etc.) on a floating point type. Then we could have just declared this query illegal, and forced the user to think more clearly about what computation they were trying to express. Is there any hope for such a resolution in the SQL standard?
          Hide
          Knut Anders Hatlen added a comment -

          I believe this happens because of optimizations that are performed if the IN list consists of constants only.

          Such IN lists are sorted at compile time so that binary search can be used to find if there's a match at run time. That's all good. However, the sorting and the binary search use different ordering. The sorting (in ValueNodeList.sortInAscendingOrder()) uses the ordering of the type with the highest precedence of the target and all the operands. The binary search (in DataType.in()) uses the ordering of the type with the highest precedence of each pair of values that it compares.

          In the query above, this means:

          The sorting happens using the type with the highest precedence of all the values. That is, DOUBLE. All the four values in the IN list have the same DOUBLE value, so the list is already sorted, regardless of how we order the actual values. But when binary search is performed at run time, BIGINT semantics are used for some of the comparisons (those that involve BIGINTs only) and DOUBLE comparison for others (those that involve a DOUBLE value). So the binary search does not see the list as one that contain values that are all equal.

          Additionally, during preprocessing, there is code to simplify the predicate if it's an IN list where all values are equal. This check also uses the dominant type, DOUBLE, and finds that the list indeed contains only one distinct value. It therefore eliminates the IN list and replaces it with a simple equality check using just one of the values in the IN list. That is, it rewrites the query from

          select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18)

          to

          select * from t where x = 9223372036854775805

          Those two queries are equivalent if the equality operator uses DOUBLE semantics. Unfortunately, the information about what's the dominant type is lost when the IN list is eliminated, and the equality check is performed using BIGINT semantics instead. The result is that only a single row matches.

          So I think there are two things that need to be fixed:

          1) The sorting and the binary search must be made consistent.

          2) The duplicate elimination must preserve type information.

          Show
          Knut Anders Hatlen added a comment - I believe this happens because of optimizations that are performed if the IN list consists of constants only. Such IN lists are sorted at compile time so that binary search can be used to find if there's a match at run time. That's all good. However, the sorting and the binary search use different ordering. The sorting (in ValueNodeList.sortInAscendingOrder()) uses the ordering of the type with the highest precedence of the target and all the operands. The binary search (in DataType.in()) uses the ordering of the type with the highest precedence of each pair of values that it compares. In the query above, this means: The sorting happens using the type with the highest precedence of all the values. That is, DOUBLE. All the four values in the IN list have the same DOUBLE value, so the list is already sorted, regardless of how we order the actual values. But when binary search is performed at run time, BIGINT semantics are used for some of the comparisons (those that involve BIGINTs only) and DOUBLE comparison for others (those that involve a DOUBLE value). So the binary search does not see the list as one that contain values that are all equal. Additionally, during preprocessing, there is code to simplify the predicate if it's an IN list where all values are equal. This check also uses the dominant type, DOUBLE, and finds that the list indeed contains only one distinct value. It therefore eliminates the IN list and replaces it with a simple equality check using just one of the values in the IN list. That is, it rewrites the query from select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18) to select * from t where x = 9223372036854775805 Those two queries are equivalent if the equality operator uses DOUBLE semantics. Unfortunately, the information about what's the dominant type is lost when the IN list is eliminated, and the equality check is performed using BIGINT semantics instead. The result is that only a single row matches. So I think there are two things that need to be fixed: 1) The sorting and the binary search must be made consistent. 2) The duplicate elimination must preserve type information.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development