I've tried to interpret what the standard says. Here are the relevant parts I've found:
> 8.4 <in predicate> - Syntax Rules
>
> 2) Let IVL be an <in value list>.
> ( IVL )
> is equivalent to the <table value constructor>:
> ( VALUES IVL )
So, according to this rule, the following two queries should be equivalent (which they are not currently):
ij> select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);
X
--------------------
9223372036854775805
1 row selected
ij> select * from t where x in (values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);
X
--------------------
9223372036854775805
9223372036854775806
9223372036854775807
3 rows selected
Furthermore, it says:
> 8.4 <in predicate> - Syntax Rules
>
> 5) The expression
> RVC IN IPV
> is equivalent to
> RVC = ANY IPV
So to find the correct semantics for IN, we need to rewrite the query to ANY. That is,
select * from t where x = any (values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18);
and see what the standard says about that. (This particular ANY query returns three rows in Derby, which is the same as the IN (VALUES ...) query above.)
This leads us to:
> 8.8 <quantified comparison predicate> - Syntax Rules
>
> 1) Let RV1 and RV2 be <row value predicand>s whose declared types are respectively that of the <row value
> predicand> and the row type of the <table subquery>. The Syntax Rules of Subclause 8.2, “<comparison
> predicate>”, are applied to:
> RV1 <comp op> RV2
That is, for the comparisons, the value on the right hand side should have the row type of the sub-query.
And the row type of our VALUES sub-query is DOUBLE (or at least some approximate numeric type) as 7.3 <table value constructor> says row type is determined by applying Subclause 9.3, “Data types of results of aggregations”, whose syntax rule 3d says:
> If any data type in DTS is approximate numeric, then each data type in DTS shall be numeric and the
> result data type is approximate numeric with implementation-defined precision.
Derby does produce the right type for the <table value constructor>:
ij> values 9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18;
1
----------------------
9.223372036854776E18
9.223372036854776E18
9.223372036854776E18
9.223372036854776E18
4 rows selected
The ANY query should therefore end up like:
select * from t where x = 9.223372036854776E18 or x = 9.223372036854776E18 or x = 9.223372036854776E18 or x = 9.223372036854776E18;
Or even simpler, because the DOUBLE representation of all four values happens to be the same:
select * from t where x = 9.223372036854776E18;
Now, 8.2 <comparison predicate> - General Rules, says this:
> 2) Numbers are compared with respect to their algebraic value.
No more details than that, I'm afraid. And no mentioning about converting the operands to the dominant type, so far as I can see.
Derby currently returns these three rows for the query:
ij> select * from t where x = 9.223372036854776E18;
X
--------------------
9223372036854775805
9223372036854775806
9223372036854775807
3 rows selected
I'm not completely convinced that all those three values have the same algebraic value as 9.223372036854776E18. But in any case I think changing how Derby performs numeric comparisons is outside the scope of this issue.
So how's this for a plan? In this issue, let's assume Derby's equality comparison operator does the right thing. The goal for now should be to make an <in value list> behave the same way as the ANY query the SQL standard says it should be equivalent to. We should have tests that use the results from the equivalent ANY queries as canons, and those tests would also alert us if we later make changes to the comparison operator in a way that makes ANY and IN behave inconsistently.
I believe this happens because of optimizations that are performed if the IN list consists of constants only.
Such IN lists are sorted at compile time so that binary search can be used to find if there's a match at run time. That's all good. However, the sorting and the binary search use different ordering. The sorting (in ValueNodeList.sortInAscendingOrder()) uses the ordering of the type with the highest precedence of the target and all the operands. The binary search (in DataType.in()) uses the ordering of the type with the highest precedence of each pair of values that it compares.
In the query above, this means:
The sorting happens using the type with the highest precedence of all the values. That is, DOUBLE. All the four values in the IN list have the same DOUBLE value, so the list is already sorted, regardless of how we order the actual values. But when binary search is performed at run time, BIGINT semantics are used for some of the comparisons (those that involve BIGINTs only) and DOUBLE comparison for others (those that involve a DOUBLE value). So the binary search does not see the list as one that contain values that are all equal.
Additionally, during preprocessing, there is code to simplify the predicate if it's an IN list where all values are equal. This check also uses the dominant type, DOUBLE, and finds that the list indeed contains only one distinct value. It therefore eliminates the IN list and replaces it with a simple equality check using just one of the values in the IN list. That is, it rewrites the query from
select * from t where x in (9223372036854775805, 9223372036854775806, 9223372036854775807, 9.223372036854776E18)
to
select * from t where x = 9223372036854775805
Those two queries are equivalent if the equality operator uses DOUBLE semantics. Unfortunately, the information about what's the dominant type is lost when the IN list is eliminated, and the equality check is performed using BIGINT semantics instead. The result is that only a single row matches.
So I think there are two things that need to be fixed:
1) The sorting and the binary search must be made consistent.
2) The duplicate elimination must preserve type information.