Issue Details (XML | Word | Printable)

Key: DERBY-3373
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Bryan Pendleton
Reporter: Thomas Vatter
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Derby

SQL "distinct" and "order by" needed together

Created: 01/Feb/08 11:04 AM   Updated: 30/Jun/09 03:55 PM
Return to search
Component/s: SQL
Affects Version/s: 10.3.2.1
Fix Version/s: 10.3.3.0, 10.4.1.3, 10.5.1.1

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works allowExpressions.diff 2008-03-08 08:38 PM Bryan Pendleton 6 kB
File Licensed for inclusion in ASF works mergeWith2351.diff 2008-03-13 03:28 AM Bryan Pendleton 6 kB
Environment: Solaris Dev Express, Java 5
Issue Links:
Reference
 

Urgency: Blocker
Bug behavior facts: Regression
Resolution Date: 22/Mar/08 01:39 AM


 Description  « Hide
I am pasting here the communication from the mailinglist. I am having a blocking and large problem with it because I have to make a release that needs the specified SQL query.


tom_ wrote:

> The errormessage is
>
> The ORDER BY clause may not specify an expression, since the query specifies
> DISTINCT
> [Error Code: 20000]
> [SQL State: 4287A]
>
> The statement is
>
> select distinct
> t1.t1_id, t2.t2value1, t2.t2value2, t2.t2value3
> from
> t1, t2, t3
> where
> ...
> order by lower(t2.t2value2) , lower(t2.t2value1) , lower(t2.t2value3)
>
>
>
>
> Dyre.Tjeldvoll wrote:
>
>> tom_ <tom12167@...> writes:
>>
>>
>>> I am using "disctinct" because of some self-joins and also needed to add
>>> an
>>> "order by" clause. An error is shown. Is it not possible to use distinct
>>> and
>>> order by together?
>>>
>> I think it is allowed. Executing
>>
>> select distinct * from sys.systables order by tablename;
>>
>> in ij works just fine. Could you show the error message you get, and
>> perhaps what the table looks like?
>>
>> --
>> dt
>>
>>
>>
« [hide part of quote]

Hi Tom -

I see what you mean using the demo DB toursDB:

ij> select * from airlines order by lower(airline_full);
A&|AIRLINE_FULL |BASIC_RATE |DISTANCE_DISCOUNT
|BUSINESS_LEVEL_FACTOR
|FIRSTCLASS_LEVEL_FACT&|ECONOMY_SE&|BUSINESS_S&|FIRSTCLASS&
-----------------------------------------------------------------------------------------------------------------------------------------------------------
AA|Amazonian Airways |0.18 |0.03
|0.5 |1.5 |20 |10 |5
US|Union Standard Airlines |0.19 |0.05
|0.4 |1.6 |20 |10 |5

2 rows selected
ij> select distinct * from airlines order by lower(airline_full);
ERROR 4287A: The ORDER BY clause may not specify an expression, since
the query specifies DISTINCT.
ij> select distinct airline_full from airlines order by lower(airline_full);
ERROR 4287A: The ORDER BY clause may not specify an expression, since
the query specifies DISTINCT.
ij>

I didn't find a JIRA enhancement to remove this restriction. I suggest
you file an Enhancement request to remove the restriction reported by
ERROR 4287A.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dyre Tjeldvoll added a comment - 06/Mar/08 03:13 PM
I'm marking this as a regression since it worked before the following fix was checked in:
 
------------------------------------------------------------------------
r555096 | bpendleton | 2007-07-11 00:06:02 +0200 (Wed, 11 Jul 2007) | 19 lines

DERBY-2351: Certain ORDER BY clauses should be rejected as invalid

This change modifies the ORDER BY clause so that it rejects certain
queries as invalid: specifically, queries which:
a) specify the set quantifier DISTINCT,
b) and also contain an ORDER BY clause which refers to a column
or expression which is not in the query result.

The problem with such queries is that we are told to return only
a single instance of the DISTINCT columns, but since the ORDER BY
clause refers to columns which are not in the DISTINCT set, if there
should be multiple candidate rows from which we choose the DISTINCT
result, we don't know which of those rows to use for the ORDER BY
processing.

When the DISTINCT and ORDER BY clauses are in conflict, Derby should
reject the query. This change modifies Derby to do so.


------------------------------------------------------------------------

Bryan Pendleton added a comment - 06/Mar/08 04:45 PM
I agree, this was definitely caused by the DERBY-2351 changes.

A possible workaround would be to include the order by expressions
into the select list. That is, modify the query from:

  select distinct airline_full from airlines order by lower(airline_full);

to

  select distinct airline_full, lower(airline_full) from airlines order by lower(airline_full);



Dyre Tjeldvoll added a comment - 06/Mar/08 08:25 PM
Changing the issue type to bug, since it is a regression

Bryan Pendleton added a comment - 08/Mar/08 02:41 PM
I think that these lines in OrderByColumn, which were added by DERBY-2351,
are the focus of this problem:

                        if (addedColumnOffset >= 0 &&
                                        target instanceof SelectNode &&
                                        ( (SelectNode)target ).hasDistinct())
                                throw StandardException.newException(SQLState.LANG_DISTINCT_ORDER_BY_EXPRESSION);

The problem is that these lines should have an additional
condition, along the lines of:

  AND, there exists at least 1 column reference among the
  column references in this ORDER BY expression, which
  does not appear as one of the columns which are being
  selected DISTINCT.

The problem is, I'm not sure how to code that AND test I'll have
to study the SelectNode data structure to see if I can easily see
how to tell "the columns which are being selected DISTINCT".

Bryan Pendleton added a comment - 08/Mar/08 08:38 PM
Attached is 'allowExpressions.diff', a patch proposal.

This proposal backs out the line of code added by DERBY-2351
which was rejecting DISTINCT queries with ORDER BY expressions
which did not appear explicitly in the SELECT list.

With this patch, Derby once again allows a range of legitimate
queries, such as:
   select distinct name from person order by lower(name)
   select distinct * from person order by age + 10

Unfortunately, with this patch, Derby now also allows the invalid query:
  select distinct name from person order by age*2

I haven't been able to quickly find an easy way to distinguish the
valid queries from the invalid ones, and so as a short term measure
I think it would be good to restore the previous (10.2 and prior)
behavior for DISTINCT queries involving ORDER BY expressions.
It was good that the DERBY-2351 patch rejected some invalid
queries, but it was far worse that the patch rejected some valid ones.

I added a number of new test cases to illustrate the behavior.

Please have a look at the patch and let us know what you think!

Thomas Nielsen added a comment - 10/Mar/08 10:11 AM
I agree it would probably be good to allow the legal queries once again, even if the expense is allowing an illegal one.
That said we should ultimately reject the bad illegal ones.

If that's our understanding, the patch looks ok. The changed test run successfully.

If I understand correctly what you're looking for in the removed if statement is something like "if none of the columns in the orderby column list are in the target.rcl, and the target is a distinct query, then throw exception" ?

Didn't the removed if statement really said the inverse?
"if there is a column in the orderby columns that isn't in the target.rcl, and target is a distinct, then throw exception"

Bryan Pendleton added a comment - 13/Mar/08 03:28 AM
Attached is 'mergeWith2351.diff', an update of the previous patch proposal
which resolves the file conflicts with the DERBY-2351 patch.

Bryan Pendleton added a comment - 14/Mar/08 03:56 AM
Hi Thomas, thanks for looking at the patch. I'm sorry I didn't get back to you sooner.

The "if" statement that I wish I could write is something like:

   if the query specified DISTINCT, and if this ORDER BY expression
   references any column which is not one of the DISTINCT
   columns, reject the query, for there may be multiple possible values of
   that column and we don't know which one to use for ordering the results.

Here's a great explanation of the problem:
https://issues.apache.org/jira/browse/DERBY-2351?focusedCommentId=12473871#action_12473871
The only difference in this case is that we're concerned with ORDER BY *expressions*,
not simple column references, and so an example query might be:

  SELECT DISTINCT name FROM person ORDER BY age * 2

and we wouldn't know whether to order the results as:
  John (age 10*2), then Mary (age 20*2)
or
  Mary (age 20*2), then John (age 30*2)

What Derby *actually* does, with this patch in place, is to implicitly include (age*2)
into the DISTINCT list, so it sees *both* John records and produces the results:
  John (age 10*2), Mary (age 20*2), John (age 30*2)

That is, even though the user specified "DISTINCT name", Derby produces two
rows with the same name.

Bryan Pendleton added a comment - 16/Mar/08 04:00 AM
Committed the change to the trunk as revision 637529.
I'll investigate merging this change back to the 10.4 branch.

Dyre Tjeldvoll added a comment - 19/Mar/08 02:19 PM
I added 10.4 as a fix version. 10.3.2.2 is already listed, so if possible, I think it would be good to merge the fix to 10.3 as well.

Dyre Tjeldvoll added a comment - 19/Mar/08 03:47 PM
Sorry. I changed the wrong issue. Changing the assignee back to Bryan.

Bryan Pendleton added a comment - 19/Mar/08 08:57 PM
Merged the trunk change to the 10.4 branch and committed as revision 639013.

Bryan Pendleton added a comment - 22/Mar/08 01:39 AM
Committed the patch to the 10.3 branch as revision 639935.