Issue Details (XML | Word | Printable)

Key: DERBY-2351
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Bryan Pendleton
Reporter: Yip Ng
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

ORDER BY with expression with distinct in the select list returns incorrect result

Created: 17/Feb/07 12:36 AM   Updated: 30/Jun/09 04:12 PM
Return to search
Component/s: SQL
Affects Version/s: 10.2.1.6, 10.2.2.0, 10.3.1.4
Fix Version/s: 10.3.3.0, 10.4.1.3, 10.5.1.1

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works d2351_aliasing.diff 2008-03-04 04:34 AM Bryan Pendleton 9 kB
File Licensed for inclusion in ASF works d2351_aliasing.diff 2008-03-02 10:42 PM Bryan Pendleton 7 kB
File Licensed for inclusion in ASF works d2351_aliasing_checkQualifiedName.diff 2008-03-06 04:29 AM Bryan Pendleton 11 kB
File Licensed for inclusion in ASF works derby_2351.diff 2007-06-17 03:30 AM Bryan Pendleton 9 kB
File Licensed for inclusion in ASF works derby_2351_v2.diff 2007-06-17 07:42 PM Bryan Pendleton 10 kB
File Licensed for inclusion in ASF works modifySynonymResults.diff 2008-03-08 08:45 PM Bryan Pendleton 12 kB
HTML File Licensed for inclusion in ASF works releaseNote.html 2008-04-01 01:29 PM Rick Hillegas 6 kB
HTML File Licensed for inclusion in ASF works releaseNote.html 2008-03-20 02:42 PM Bryan Pendleton 6 kB
HTML File Licensed for inclusion in ASF works releaseNote.html 2008-03-20 05:20 AM Bryan Pendleton 4 kB
File Licensed for inclusion in ASF works reproTests.diff 2007-06-10 09:45 PM Bryan Pendleton 5 kB
Environment: Any
Issue Links:
Reference

Issue & fix info: Release Note Needed
Resolution Date: 16/Mar/08 03:51 AM


 Description  « Hide
When distinct is in the select list and the query has order by with expression, the resultset produced contains an additional column.

ij> create table t1 (c1 int, c2 varchar(10))
0 rows inserted/updated/deleted
ij> insert into t1 values (1,'a'),(2,'b'),(3,'c');
3 rows inserted/updated/deleted
select distinct c1, c2 from t1 order by c1;
C1 |C2
----------------------
1 |a
2 |b
3 |c

3 rows selected
ij> select distinct c1, c2 from t1 order by c1+1;
C1 |C2 |3 <=====returns 3 columns, incorrect result returned
----------------------------------
1 |a |2
2 |b |3
3 |c |4

3 rows selected


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Bryan Pendleton added a comment - 17/Feb/07 01:29 AM
Hi Yip. I haven't spent any time looking at this yet, just wanted to mention a hunch that this might be related to DERBY-1861. If you apply the patch for DERBY-1861, does the behavior change at all?

Yip Ng added a comment - 17/Feb/07 02:01 AM
Hi Bryan. Let me try to apply the DERBY-1861 patch and see if it is related. Meanwhile, it looks like the error is not limited to order by with expression, I get the same behavior with column reference.

ij> select distinct c2 from t1 order by c1;
C2 |C1
----------------------
c |1
b |2
a |3

Yip Ng added a comment - 17/Feb/07 02:24 AM
I applied the DERBY-1861 patch but the result is the same, so it may or may not be related.

I remembered that in some other databases(DB2 and Oracle), this construction is not allowed but I can't recall for now that if this is an implementation or a standards restriction.
 

Bryan Pendleton added a comment - 17/Feb/07 02:29 AM
Thanks for checking, Yip! It was worth a try... :)

Yip Ng added a comment - 17/Feb/07 03:07 AM
Ah, I think I see why this particular case should throw an exception. The result is not predictable (and not portable) if the ORDER BY sort key is not in the SELECT list. Consider the following example:

create table person (name varchar(10), age int);
insert into person values ('John', 10);
insert into person values ('John', 30);
insert into person values ('Mary', 20);

SELECT DISTINCT name FROM person ORDER BY age;

So which row to display for John? The John (age 10) or the other John (age 30)?
Result may be:
John
Mary

or

Mary
John

Derby should not allow this and return an error.

Bryan Pendleton added a comment - 10/Jun/07 09:45 PM
The symptoms of this problem seem to have changed in the last 6 months. However, there is definitely still something wrong. With the latest trunk, I don't see the extra mystery columns in the IJ output. However, I do appear to be getting wrong results.

And I agree with Yip that in the wrong results case, the query is ambiguous and should be rejected with an error. Interestingly, in the wrong results case, Derby currently chooses neither of the answers that Yip suggested, but instead chooses to display all three rows! Thus Derby currently chooses to violate the DISTINCT part of the query, rather than violating the ORDER BY part of the query. Given that the two parts of the query are in conflict, I'm not sure that it really matters very much which part of the query Derby disobeys, as either way is wrong and the only thing that can be done is to reject the query as ambiguous.

I'm attaching reproTests.diff, a patch proposal with some new test cases for the ORDER BY test suite. These test cases demonstrate the wrong results.

Although I don't currently have a fix for this problem, or even a clear idea of how to fix it, I thought that it would be useful to post the test cases anyway.

Bryan Pendleton added a comment - 15/Jun/07 04:44 AM
DERBY-2805 also describes problems when DISTINCT and ORDER BY are combined in a single query. However, in the DERBY-2805 case, there is no ambiguity, and the query is valid as phrased; the issue involves how the compiler propagates the information about row ordering and duplicate elimination through the query tree. So the two issues are not closely related. Still, it seems useful to link them.

Bryan Pendleton added a comment - 16/Jun/07 03:02 PM
The SQL Standard appears to agree with the conclusion that queries such as those Yip has constructed should be rejected as invalid. Here's a brief synopsis of why I think the standard says this:

- Section 14.1 (2003 standard) covers DECLARE CURSOR.
- Syntax Rule 18 covers the ORDER BY clause
- Syntax Rule 18.d.i covers the case in which the ORDER BY clause contains a sort key that contains a column reference to a column that is not a column of the result of the query expression
- Syntax Rule 18.d.i.9.B.II states:

  QS shall not specify the <set quantifier> DISTINCT or directly contain one or more <set function specification>s.

So I propose to modify the ORDER BY column checking so that it rejects such queries.

Bryan Pendleton added a comment - 17/Jun/07 03:30 AM
Attached is derby_2351.diff, which is not for commit. This patch
proposal is close, but not quite correct.

The problem with this patch proposal is that it rejects

  SELECT distinct emp.* FROM employee emp ORDER BY emp.name

At the moment of "pull-up" processing, "emp.name" does not appear
in the result set of the query expression ("emp.*"), and so the change
rejects it, even though later in bind processing it would be the case
that "emp.name" *does* appear, and hence should be allowed, once
the all-columns wildcard is expanded.

I'll take another shot at this change later, but wanted to capture this
first attempt in the JIRA record anyway, as it represents a step along
my thinking about the problem.

Bryan Pendleton added a comment - 17/Jun/07 07:42 PM
Attached is derby_2351_v2.diff, a patch proposal.

This proposal relocates the "order by column must be in the
distinct result set" check to later in bind processing, after the
asterisk(s), if present, have been expanded to result column lists,
which makes the patch behave properly for statements like

  SELECT DISTINCT * FROM t1 ORDER BY c1

The patch also includes some additional tests.

I believe the patch is now ready for review and suggestions.

The patch passes all of derbyall and suites.All.

Please let me know your comments.

Bryan Pendleton added a comment - 10/Jul/07 10:09 PM
Committed derby_2351_v2.diff to the trunk as revision 555096.

I'm leaving this issue open for a while, because I think this change
is probably worth merging to the 10.3 branch.

Bryan Pendleton added a comment - 12/Aug/07 10:25 PM
Committed the change to the 10.3 branch as revision 565184.

Need to update the fix versions, as JIRA isn't showing me an appropriate fix version right now. The appropriate fix version is probably 10.3.2.0 or something like that.

A B added a comment - 18/Feb/08 05:05 PM
Per the user thread found here:

  http://article.gmane.org/gmane.comp.apache.db.derby.user/8393

the following query appears to have worked in 10.3.1.4 but fails in 10.3.2.1 and later:

  create table t(c1 int, c2 int);
  select distinct c1 from t order by c1; -- works
  select distinct c1 as a1 from t order by c1; -- used to work, now fails

The failure shows the exception that was added for this issue, namely:

  ERROR 42879: The ORDER BY clause may not contain column 'C1', since the query specifies
  DISTINCT and that column does not appear in the query result.

I skimmed over the comments for this issue and, from what I can tell, the above query (with the column alias) is *not* ambiguous--at least, not in the way that this issue describes. I.e. if the alias "A1" can be identified as pointing to "C1" (which should be possible...I think?) then the query satisfies the requirement that ORDER BY columns be a subset of the DISTINCT columns.

Was it the intent of this issue to deliberately block queries such as this one, or was that an accident? An easy enough workaround exists--just specify "A1" in the order by clause instead of "C1"--but I think the question remains: is that supposed to be necessary?

Note: I haven't actually done a pre- and post- commit check for this specific issue, I'm just assuming (perhaps incorrectly) that this issue is the one that changed the behavior, given the discussion and the new error code. Apologies if that assumption is wrong...

Bryan Pendleton added a comment - 18/Feb/08 09:15 PM
It was not the intent of this change to reject the alias form of the query; this change
was specifically intended to address the issue that Yip raised in Feb 2007:
https://issues.apache.org/jira/browse/DERBY-2351?focusedCommentId=12473871#action_12473871
in which there was a conflict between the DISTINCT and ORDER BY
specifications in the query.

I agree with Army that "select distinct c1 as a1 from t order by c1" does not
contain such a conflict, and so this patch did not intend to reject that query.

However, it is not clear to me whether that query ought to be allowed or not,
since I don't understand whether it is supposed to be legal to refer to an
underlying base table column in the ORDER BY clause, or whether the column
alias is *required* to be used. Note, in particular, this comment by Jack Klebanoff
regarding the topic:
https://issues.apache.org/jira/browse/DERBY-84?focusedCommentId=64077#action_64077

I'm worried that we may have a number of related problems in this whole area of
handling expressions, aliasing, column number references, and simple column
references in ORDER BY, GROUP BY, and HAVING clauses. Note, for example,
DERBY-2457, DERBY-2085, DERBY-84, DERBY-280, DERBY-1861, DERBY-127,
and DERBY-3094 for some other examples, both past and present.

I think we need to spend some time understanding how the SQL Standard intends
that the ORDER BY, GROUP BY, and HAVING clauses should behave with respect
to column references, columns identified by column number, and value expressions,
and then we need to ensure that Derby is behaving correctly.



A B added a comment - 18/Feb/08 10:45 PM
> It was not the intent of this change to reject the alias form of the query [...]

Okay, thank you for the reply, Bryan. And thanks for the investigation that you've done to reply so thoroughly.

> However, it is not clear to me whether that query ought to be allowed or not [...]

From the various responses to the user thread referenced in my previous comment, it sounds like different databases treat this kind of thing differently. So I don't know what the "best" behavior would be here, either...

For the record, I was merely asking the question because a query which worked in one release stopped working in the next, with no clear explanation as to why. The response on the user list was generally "well that doesn't work in some other databases, so don't do it; use the alias instead"--which is fine as a workaround, but it left me wondering why the behavior changed between releases to begin with. Your reply above filled in the missing information for me--so thanks!

> I think we need to spend some time understanding how the SQL Standard intends
> that the ORDER BY, GROUP BY, and HAVING clauses should behave [...]

If the standard lays it all out, then I agree, that certainly seems like a good way to go...

Bryan Pendleton added a comment - 19/Feb/08 12:30 AM
Re-opening the issue to reflect the observation that queries that previously
worked are now being rejected.

Bryan Pendleton added a comment - 19/Feb/08 02:41 AM
Here's an interesting example of a query combining both expressions
and column aliases in the ORDER BY clause:

 create table t1 (a varchar(10));
 insert into t1 values ('M'), ('k'), ('c'), ('K'), ('m');
 SELECT CONCAT('test', a) AS str FROM t1 ORDER BY UPPER(str);

Bryan Pendleton added a comment - 19/Feb/08 02:58 AM
And another rather similar query. For this one, the poster commented that
the query worked if the ORDER BY specified NAME1, but not if it specified FIRSTNAME.

SELECT DISTINCT ISNULL(FirstName, '') AS Name1 FROM Person.Contact ORDER BY FirstName

Dyre Tjeldvoll added a comment - 29/Feb/08 09:12 PM
I just changed the fix version from 10.4 and 10.3.2.1 to 10.4 and 10.3.2.2 since 10.3.2.1 has already been released without this fix.

Bryan Pendleton added a comment - 02/Mar/08 06:29 PM
Currently, the code which associates ORDER BY column specs with the ResultColumn
instances in the SELECT's RCL examines the SELECT list using only the "exposedName".
The exposedName is the same as the column name for simple column references,
but is the alias name when a column alias is used, and is an internally generated name
for expressions. So in:

  select age, name as first_name, age / 7 as age_in_dog_years, upper(name) from person;

we have 4 columns, with the exposedName 's of:
 - age
 - first_name
 - age_in_dog_years
 - SQLCol1

Since ORDER BY looks only at the exposed name, when it sees a statement like:

  select name as first_name from person order by name;

it doesn't find the column 'name' in the SELECT RCL, so it decides thta it needs
to "pull up" the name column from the underlying table.

Since pulled up columns are invalid for DISTINCT queries, we get the error in question.

It seems relatively straightforward to enhance ResultColumnList.java's
findResultColumnForOrderBy() and getOrderByColumnToLink() methods so that
they search for columns using both the exposedName *and* the underlying
column reference's name, thus making the order by column *not* be a pulled-up
column, and thus valid in DISTINCT queries, and that does indeed solve the repro
script from the 18-feb-2008 comment on this issue.

I'm investigating working that change up as a proper patch, with additional test
cases, and running the current regression tests to try to figure out what else may break
as a result of introducing this new column resolution algorithm.

Bryan Pendleton added a comment - 02/Mar/08 10:42 PM
Matching first on the exposedName, then on the underlying name,
seems like a promising technique for this issue. Attached is
d2351_aliasing.diff, a patch proposal with a code change as
described in the previous comment and some additional tests.

This patch passes the existing regression tests. Comments are
most welcome.

Thomas Nielsen added a comment - 03/Mar/08 09:30 AM
I had a look at the patch, and IMHO the code and approach taken look good :) Good combination of queries tested.

However, the master diff (orderby.out) is missing for orderby.sql - could you please attach that too?

A B added a comment - 03/Mar/08 06:17 PM
Hi Bryan, thanks for the patch. Two minor comments:

1. Slight whitespace inconsistency (w.r.t. existing code) in ResultColumn.java
2. The following query (pulled from your modified orderby.out) currently fails with error 42X79:

      +ij> select distinct name as age from person order by person.age;
      +ERROR 42X79: Column name 'AGE' appears more than once in the result of the query expression.

It's good that the query fails, but I wonder if error 42X79 is the best error here? From the user's perspective there is only a single column in the query's result list, so the error seems slightly misleading. Instead, I think error 42879 (added as part of this issue) is more appropriate, since the query attempts to order by a column that is not in the DISTINCT list. Any idea how hard it would be to catch that scenario and throw 42879 instead of 42X89? Would this just be a matter of checking to see if the column was "pulled" into the select list and, if so, throwing 42879 instead of 42X79? Or is it more complicated than that? I don't think this is a big deal at all--as long as there is an error I think it's fine--but I thought I'd ask the question to see if you have already considered it.

Bryan Pendleton added a comment - 04/Mar/08 04:34 AM
Thanks Thomas and Army for having a look at the patch.

Attached is a revised version of d2351_aliasing.diff, with
the whitespace corrected in ResultColumn.java (I think),
and a few additional test cases.

Thomas, please let me know if you aren't able to see
orderby.out in this patch.

Army, with respect to the error message, I agree that it is
confusing, as there only appears to be one column in
the result of the query expression. However, I think this
behavior predates this patch, and moreover it is also
independent of whether DISTINCT is present. Even without
this patch applied:

ij> select name as age from person order by person.age;
ERROR 42X79: Column name 'AGE' appears more than once in the result of the query expression.

While looking at this, however, I discovered a possibly more
disturbing problem. The following query *works* before the
patch, but is *rejected* after the patch:

   select person.name as name, pets.name as pet_name from person,pets order by name;

Before this patch, the ORDER BY analysis looks only at the exposed
names 'name' and 'pet_name'. But with the patch in place, the ORDER BY
processing sees that the column with the exposed name pet_name
also has the underlying source column name 'name', and so it sees
two possible columns to which 'ORDER BY NAME' could refer, and
refuses the query as ambiguous.

I'm not quite sure what to do, but I'm attaching the updated patch
anyway, to try to keep the discussion going.


Bryan Pendleton added a comment - 04/Mar/08 04:51 AM
In Derby 10.2, all of the following queries work:

    -- orders the rows by the first column (person.name):
    select person.name as name, pets.name as pet_name from person, pets order by name;
    -- orders the rows by the first column (person.name, aliased to 'age'):
    select person.name as age from person order by person.age;
    -- again appears to order the rows by the aliased column?
    select distinct person.name as age from person order by person.age;

The first query is only ambiguous if we interpret an unqualified ORDER BY
reference to refer to either an exposedName or to an underlying column name.
The third query, of course, is inspired by the query that Yip noted as problematic
in the comment on 16-Feb-2007. I think that the results from
the second query in 10.2 are particularly disturbing, since even though the
user specifically said to sort by 'person.age', the 10.2 code apparently sorted
by the exposedName column 'age', since if we had really sorted the rows by
person.age we'd have produced the result

AGE
----------
john
mary
john

So, how bad is it to break these queries? They did not throw errors before,
but were they giving the correct results?

The complete 10.2 script session is pasted below.
============================================================
ij version 10.2
ij> connect 'jdbc:derby:ten2db;create=true';
ij> create table person (name varchar(10), age int);
0 rows inserted/updated/deleted
ij> create table pets (name varchar(10), age int);
0 rows inserted/updated/deleted
ij> insert into person values ('john', 30), ('mary', 50), ('john', 10);
3 rows inserted/updated/deleted
ij> insert into pets values ('Buster', 1), ('Fido', 3);
2 rows inserted/updated/deleted
ij> select person.name as name, pets.name as pet_name from person, pets order by name;
NAME |PET_NAME
---------------------
john |Fido
john |Fido
john |Buster
john |Buster
mary |Fido
mary |Buster

6 rows selected
ij> select person.name as age from person order by person.age;
AGE
----------
john
john
mary

3 rows selected
ij> select distinct person.name as age from person order by person.age;
AGE
----------
john
mary

Thomas Nielsen added a comment - 04/Mar/08 08:32 AM
Bryan>Attached is a revised version of d2351_aliasing.diff, with
Bryan>the whitespace corrected in ResultColumn.java (I think),

No whitespace in the currnet diff :)

>Thomas, please let me know if you aren't able to see orderby.out in this patch.

The .out was there all along, it just doesn't apply for some reason. Same problem with the current patch. It was made from the head of trunk, right?

I agree that the second query from 10.2 produces the wrong results - it's using an explicit table.columnname reference for ordering. In this particular case an ambigous error would be better that wrong results.

Could we use the fact that the user specified tableName.columnName, and not just columnName to distinguish between the aliased and original column names somehow? If using t.c notation you could actually exclude aliased columns from the check.

This would mean
   
   select distinct person.name as age from person order by person.age;
   => explicit check on column named 'age' in table 'person'

   select distinct person.name as age from person order by age;
   => alias 'age' exists, check alias
   
   select distinct person.name as their_age from person order by age;
   => alias 'age' does not exists, check 'person' for 'age'

  select person.name as name, pets.name as pet_name from person,pets order by name;
  => alias 'name' exists, check alias

   select person.name as person_name, pets.name as pet_name from person,pets order by person.name;
   => explicit check on column named 'name' in table 'person'

but

   select person.name as person_name, pets.name as pet_name from person,pets order by name;
   => ambiguous, no alias 'name', but both 'person' and 'pets' have column 'name'

We may actually lack information on whether the user did explicitly use t.c or only column/alias name at this stage, so it might not be possible at all for all I know. I also see potential for breaking existing applications with such changes. But if keeping the old behavior produces wrong results, your current patch with throwing an ambiguous exception is still a lot better than returning wrong results IMHO!

A B added a comment - 04/Mar/08 05:20 PM
> I think this behavior predates this patch, and moreover it is also
> independent of whether DISTINCT is present.

Ah, okay, thanks for pointing that out.

> if we had really sorted the rows by person.age we'd have produced the result
>
> AGE
> ----------
> john
> mary
> john

Just to be clear, note the following:

ij> select * from person order by age;
NAME |AGE
----------------------
john |10
john |30
mary |50

If we sort the rows by person.age the order is, in fact, "john, john, mary". So from that it's hard to tell whether 10.2 sorted the query on the alias "AGE" (meaning column NAME) or the actual column AGE. But when I added another row, ('zack', 5), we see the problem you're talking about:

ij> insert into person values ('zack', 5);
1 row inserted/updated/deleted

ij> select person.name as age from person order by person.age;
AGE
----------
john
john
mary
zack

Since we're supposed to be sorting by person.age, zack should be first, not last. So you're right, 10.2 seems wrong.

> how bad is it to break these queries? They did not throw errors before, but were they
> giving the correct results?

I agree with Thomas in that it seems reasonable to throw an error instead of returning wrong results--as long as we indicate to users that such a change happens. Which is, incidentally, exactly what the original fix for this issue did: a query that used to work is now rejected as invalid.

And that in turn begs the question: should this issue have been marked "Existing Application Impact" since the solution affects existing applications (queries that used to run without error will now fail)? Seems like 10.3.2.1 was released with the initial fix, which changed the behavior, but no indication of existing application impact was made. I assume that was unintentional?

Bryan Pendleton added a comment - 04/Mar/08 05:47 PM
The fix to this issue may impact existing applications because some
queries which were previously accepted and executed will now
return errors. The intent of the fix is that the only queries that fall
into that category are those which are ambiguous due to the fact
that they specify a DISTINCT set of results, but the ORDER BY clause
mentions one or more columns which are not present in the DISTINCT
clause, introducing an ambiguity about how to perform the ordering
when there exist multiple rows with the same values for the DISTINCT
column(s) but different values for the ORDER BY columns.

A B added a comment - 05/Mar/08 07:33 PM
Army> it seems reasonable to throw an error instead of returning wrong results--as long
Army> as we indicate to users that such a change happens.

Perhaps this is obvious, but it occurs to me that the difference between the query being discussed and the query that was originally disallowed by changes for this Jira is that the latter was disallowed because the semantics are not defined, while the former has a well-defined behavior that Derby fails to accomplish. I don't think this changes the fact that it's (probably) better to throw an error than to allow wrong results, but I *do* think that if a commit is made which causes the second query (from Bryan's Mar 03 comments) to start throwing an error (instead of returning results in the wrong order), it would be good to file another Jira defect to indicate that such an error is due to a known (current) Derby limitation.

Bryan Pendleton added a comment - 06/Mar/08 04:29 AM
Thomas, I thought that your suggestion about using different
name resolution rules depending on whether the ORDER BY
column reference was qualified or unqualified was excellent,
and I decided to give it a try.

Attached is d2351_aliasing_checkQualifiedName.diff,
which uses different resolution rules depending on whether
the ORDER BY column reference is table.column, or just column:
 - if the table name is provided, we match against the
   underlying table name, and don't consider any aliases
 - if the table name is NOT provided, we first match against
   the alias name, if present, and if no alias name matches
   then we match against the underlying source column name.

It seems to resolve the problematic queries that we've been
discussing over the past several days, and doesn't obviously
break any of the other queries in the orderby.sql suite.

I'm running a more complete set of regression tests overnight.
Please have a look at this latest patch and let me know what you think.

Thomas Nielsen added a comment - 06/Mar/08 10:14 AM
That's good news Bryan :)

One thing you could perhaps fix is the mix of spaces and tabs in the diff? Other than that it looks good.

The updated orberby.sql test seems to cover the different cases nicely, and runs those queries successfully.
I did however get one new error in the lang._Suite that looks related to the change:
----
1) synonym(org.apache.derbyTesting.functionTests.tests.lang.LangScripts)junit.framework.ComparisonFailure: Output at line 526 expected:<[IDCOLUMN1 |IDCOLUMN2 ]> but was:<[ERROR 42X04: Column 'T1.IDCOLUMN1' is either not in any table in the FROM list or appears within a join specification and is outside the scope of the join specification or appears in a HAVING clause and is not in the GROUP BY list. If this is a CREATE or ALTER TABLE statement then 'T1.IDCOLUMN1' is not a column in the target table.]>
        at org.apache.derbyTesting.functionTests.util.CanonTestCase.compareCanon(CanonTestCase.java:100)
        at org.apache.derbyTesting.functionTests.util.ScriptTestCase.runTest(ScriptTestCase.java:124)
        at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:101)
        at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
        at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
        at junit.extensions.TestSetup.run(TestSetup.java:25)
        at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
---

Bryan Pendleton added a comment - 06/Mar/08 02:49 PM
Hi Thomas, thanks for catching this. I see the same failure in my environment.

Here's the statement in question:

    select t1.id as idcolumn1, t1.id as idcolumn2 from t1 order by t1.idcolumn1, t1.idcolumn2;

Without my patch, it binds "t1.idcolumn1" to the alias "idcolumn1",
but with my patch it rejects the query because there is no column
"idcolumn1" in table "t1".

The behavior of the patch seems "correct", according to the rules we discussed.

But it does cause a query which was formerly accepted, to be rejected.

My feeling is that the query above deserves to be rejected, but I wonder if anyone
knows what the SQL Standard says about the construct "tablename.aliasname"?


Bryan Pendleton added a comment - 06/Mar/08 02:54 PM
I suppose one last try would be:
 - unqualified ORDER BY reference: bind first to alias, then to column in (any) underlying table
 - qualified ORDER BY reference: bind first to column in corresponding underlying table, then to alias

That would make the nasty query from synonym.sql work, I suppose,
but it doesn't feel like very pleasant code to need to write. If I could
start from scratch I'd prefer to make all these ambiguous cases be
errors, as I suspect that users don't really want to relabel one column
with an identical name to another column; they'd prefer to always choose
unique alias names and have no ambiguity about their queries.


A B added a comment - 06/Mar/08 04:38 PM
I haven't reviewed the latest patch yet, but regarding:

> select t1.id as idcolumn1, t1.id as idcolumn2 from t1 order by t1.idcolumn1, t1.idcolumn2;
>
> The behavior of the patch seems "correct", according to the rules we discussed.
> My feeling is that the query above deserves to be rejected

I agree with Bryan on this one, i.e. it seems reasonable to throw an error in this case. I did try something similar on DB2 v8 and it throws an error:

  select t1.i as a, t1.j as b from t1 order by t1.a, t1.b
  SQL0206N "T1.A" is not valid in the context where it is used.

  select t1.i as a, t1.j as b from t1 order by t1.i, t1.b
  SQL0206N "T1.B" is not valid in the context where it is used.

  select t1.i as a, t1.j as b from t1 order by t1.i, t1.j
  <works as expected>

So there is a precedent, for what it's worth...

Thomas Nielsen added a comment - 06/Mar/08 05:03 PM
Just to add to the bahavioural patterns - running the queries Army posted though another OSDB give the same results as Army see in DB2:

select t1.c1 as i, t1.c2 as j from t1 order by t1.i, t1.j ;
ERROR: column t1.i does not exist
SQL state: 42703
Character: 48

select t1.c1 as a, t1.c2 as b from t1 order by t1.i, t1.c2
ERROR: column t1.i does not exist
SQL state: 42703
Character: 49

select t1.c1 as a, t1.c2 as b from t1 order by t1.c1, t1.c2
<works as expected>

Bryan Pendleton added a comment - 07/Mar/08 04:23 AM
Thanks Thomas and Army for investigating the behavior of other DB's.
This is encouraging, for it seems to substantiate our understanding
of how the binding rules ought to behave.

Another unresolved question involves the queries in DERBY-3373; it
would be nice to construct a patch which solved those problems as well.
However, the problem in DERBY-3373 is rather different than the ones
we've been discussing here recently, as it doesn't involve aliasing, but
rather involves the difference between an expression which refers to a
column, and a simple direct column reference.

Bryan Pendleton added a comment - 08/Mar/08 04:12 PM
I'm considering modifying synonym.out to reflect that

    select t1.id as idcolumn1, t1.id as idcolumn2 from t1 order by t1.idcolumn1, t1.idcolumn2;

should get an error, and then proceeding with a commit of this patch.

I think that the patch improves substantially upon the current behavior
of the trunk, accepting many more legitimate queries and refusing only
more questionable queries.

I think the DERBY-3373 issue can be addressed separately.

Bryan Pendleton added a comment - 08/Mar/08 08:45 PM
Patch 'modifySynonymResult.diff' includes the modifies results for the synonym test.

I think this patch is worthy of commit; please have a
look and let me know what you think.

Thomas Nielsen added a comment - 10/Mar/08 08:48 AM
I agree, 'modifySynonymResult.diff' is worthy of commit.

A B added a comment - 10/Mar/08 04:27 PM
modifySynonymResult.diff looks good to me. Might be nice if the whitespace inconsistencies could be patched up a bit, but functionally I think the patch is good.

Two minor suggestions:

1) Could the logic in ColumnReference.columnNameMatches() use the new "getSourceColumnName()" method, instead of duplicating the "instanceof ColumnReference" check?

2) Do you think it would be worth it add some sanity queries for table aliasing, as well? Ex, I ran the following:

  -- All succeed.
  select distinct i a from t1 hmm order by i;
  select distinct i a from t1 hmm order by a;
  select distinct i a from t1 hmm order by hmm.i;

  -- All throw an error.
  select distinct i a from t1 hmm order by t1.i;
  select distinct i a from t1 hmm order by t1.a;
  select distinct i a from t1 hmm order by hmm.a;
 
From what I can tell treatment of all of the above queries is correct, so this would just be a matter of adding them to the test script. That could come as a separate patch, though.

Thanks for your persistence on this one, Bryan!

Bryan Pendleton added a comment - 13/Mar/08 03:02 AM
Thanks Army and Thomas for the continued help! I re-worked modifySynonymResult.diff
slightly to incorporate Army's suggestions and committed it to the trunk as revision 636608.

If no additional problems arise with this patch in the trunk, I'll investigate merging
back to the 10.4 branch.

Bryan Pendleton added a comment - 16/Mar/08 03:51 AM
I merged the patch to the 10.4 branch and re-ran the tests without incident.
I committed the change to the 10.4 branch as revision 637526.

Dyre Tjeldvoll added a comment - 18/Mar/08 10:13 PM
This issue has fix version 10.4 and is marked with either 'Release note needed' or 'Existing application impact', but does not have a releaseNote.html attached to it. Should it?

Bryan Pendleton added a comment - 19/Mar/08 04:07 PM
I agree, it should have a release note. I'll try to get one written soon.

Bryan Pendleton added a comment - 20/Mar/08 05:20 AM
A draft release note. Please review.

Bryan Pendleton added a comment - 20/Mar/08 02:42 PM
I updated the release note this morning after realizing that this
issue has two application-visible impacts.

Please review the revised releaseNote.html.

Bryan Pendleton added a comment - 21/Mar/08 04:19 PM
Merged the modifySynonymResult.diff patch to the 10.3 branch as revision 639696.

Thomas Nielsen added a comment - 25/Mar/08 08:16 AM
Updated releaseNote.html looks good IMHO :)

Rick Hillegas added a comment - 01/Apr/08 01:29 PM
Scrubbed the release note so that it will survive the syntax checks enforced by ReleaseNoteReader.

Dyre Tjeldvoll added a comment - 02/Apr/08 08:29 AM
Checking 'release note needed' so the release note filter can pick it up.