Issue Details (XML | Word | Printable)

Key: DERBY-3613
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Bryan Pendleton
Reporter: Artur Kuś
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

SELECT DISTINCT field FROM TABLE_NAME GROUP BY field, field2

Created: 11/Apr/08 01:28 PM   Updated: 04/May/09 06:22 PM
Return to search
Component/s: SQL
Affects Version/s: 10.3.2.1
Fix Version/s: 10.4.2.0, 10.5.1.1

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works addCommentAndAnotherTest.diff 2008-04-23 02:17 AM Bryan Pendleton 6 kB
File Licensed for inclusion in ASF works dontGenerateForDistinct.diff 2008-04-22 02:37 AM Bryan Pendleton 5 kB
File Licensed for inclusion in ASF works tenThree.diff 2008-04-24 10:20 PM Bryan Pendleton 5 kB
Environment: Windows XP

Resolution Date: 25/Apr/08 02:09 PM


 Description  « Hide
Query 'SELECT DISTINCT field FROM TABLE_NAME GROUP BY field, field2' not work ok.
 Distinct is ignored !!!

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Bryan Pendleton added a comment - 11/Apr/08 02:20 PM
Can you provide an example SQL script, perhaps a captured IJ session,
which shows the specific tables, statements, and data you are using,
the results you get, and the results you expect?

Artur Kuś added a comment - 11/Apr/08 03:31 PM
Simple :

CREATE TABLE tabla_name(field varchar(10), field2 varchar(10 ));
INSERT INTO tabla_name VALUES('first', 'first');
INSERT INTO tabla_name VALUES('first', 'second');
SELECT DISTINCT field FROM tabla_name GROUP BY field, field2.

I expect only one 1 row, but I get two rows.
Why??

Bryan Pendleton added a comment - 11/Apr/08 03:37 PM
I believe you get two rows because the results have been partitioned
into two groups, as you specified in the GROUP BY.

Within each group you are getting only one distinct value of 'field',
but there are two groups total.

Artur Kuś added a comment - 11/Apr/08 03:42 PM
Thanks.
It is very funny.
Oracle, MSDE and Postgres returns only one rows. :)
JavaDB is worse ??


Bryan Pendleton added a comment - 11/Apr/08 03:54 PM
Derby might be interpreting the standards incorrectly here, I don't know.

Do you have access to the SQL standard to help us understand what
the specified behavior should be?

Artur Kuś added a comment - 14/Apr/08 07:55 AM
I starting from the page
http://en.wikipedia.org/wiki/SQL
and next http://en.wikipedia.org/wiki/SQL-92
and next open the page http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt.
In the specyfication I found that
"If DISTINCT is specified, then let TXA be the result of elimi-
            nating redundant duplicate values from TX. Otherwise, let TXA be
            TX."

So, in JavaDB is not ok :)


 

Knut Anders Hatlen added a comment - 14/Apr/08 12:58 PM
I tried the query with different versions of Derby:

10.0.1.2: two rows
10.1.1.0: two rows
10.1.2.1: one row
10.2.1.6: one row
10.2.2.0: one row
10.3.1.4: two rows
10.3.2.1: two rows
10.4.1.1 (RC): two rows

So, given that one row is the correct result, the bug was fixed at some point, but it regressed back later.

Knut Anders Hatlen added a comment - 14/Apr/08 03:00 PM
The first change (two rows --> one row) happened in this commit:

------------------------------------------------------------------------
r267239 | bandaram | 2005-09-02 20:07:39 +0200 (Fri, 02 Sep 2005) | 6 lines

DERBY-504: Disable pushing down DISTINCT when number of columns from the parent query and the subquery don't match. Can lead to wrong results.

Thanks to Rick for persistant reviews and running the tests.

Submitted by Knut Anders Hatlen (Knut.Hatlen@Sun.COM)

------------------------------------------------------------------------


The old behaviour (one row --> two rows) was reintroduced in this commit:

------------------------------------------------------------------------
r516454 | abrown | 2007-03-09 17:37:20 +0100 (Fri, 09 Mar 2007) | 8 lines

DERBY-681: Remove the "wrap group by's in a subselect" rewrite in the parser.
This patch preserves the having clause through bind and optimize phases and
then, during the final rewrite for aggregates in the GroupByNode, it transforms
the having clause to a valid restriction. See text file attached to the Jira
for more information.

Contributed by Manish Khettry (manish_khettry@yahoo.com)

------------------------------------------------------------------------

Bryan Pendleton added a comment - 20/Apr/08 02:45 AM
This page suggests that certain versions of Sybase may have similar behavior to Derby:

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.ase_15.0.sqlug/html/sqlug/sqlug126.htm

It seems like Derby could recognize that:

  select distinct a from t group by a, b

is identical to

  select distinct a from t group by a

I can think of 3 possible ways to give Derby this behavior:
1) When processing GROUP by columns, and deciding whether to pull them up
into the result column list as generated columns, detect this particular case and
don't pull up the generated GROUP BY column "b" into the result list.
2) In the compilation process, prior to generating the distinct scan result set,
detect that un-necessary column "b" has been included into the result column list
and skip it when generating the data structures for execution.
3) At execution time, when setting up the sorter, recognize that column "b" isn't
needed in the sort key, and only sort/collapse the records on column "a".

It seems best to detect this situation earlier rather than later, so I'm partial
to solutions which can solve the problem at compilation time, not execution time.

Bryan Pendleton added a comment - 22/Apr/08 02:37 AM
Attached is 'dontGenerateForDistinct.diff', a patch proposal.

This patch modifies the processing of GROUP BY columns
which are "generated" into the select's result column list.
Specifically, the patch causes columns to be generated
into the select RCL only if the select does *not* specify DISTINCT.

If the select specifies DISTINCT, we don't want to include
any additional columns into the RCL because we want to
be sure that we only perform DISTINCT processing on
the columns that were specified by the user.

The patch also includes some additional test cases, based
on the repro case in the issue description.

I ran a complete set of regression tests with the modified code
and it passed all the existing tests, as well as the new tests
added by this patch.

Please have a look and let me know what you think.

Thomas Nielsen added a comment - 22/Apr/08 06:52 AM
The approach taken in the proposed patch looks good. Test changes seem to cover the failing scenarios well.
Running a few tests on the patch now, but unless they show any problems I'm +1 to commit.


Thomas Nielsen added a comment - 22/Apr/08 11:38 AM
No errors seen in my testing - so +1 for commit.

Bryan Pendleton added a comment - 23/Apr/08 02:17 AM
Thanks for the review and testing Thomas!

I updated the patch with (a) a short comment on the code change,
and (b) a few other test cases which I created by looking at some
of the other test cases in GroupByTest and adding DISTINCT to them.

The revised patch is attached.

Bryan Pendleton added a comment - 23/Apr/08 02:19 AM
Committed addCommentAndAnotherTest.diff to the trunk as revision 650728.

I'll investigate merging this patch back to the 10.4 and 10.3 lines.

Bryan Pendleton added a comment - 24/Apr/08 02:27 PM
Merged the trunk change to the 10.4 branch as revision 651275.

Bryan Pendleton added a comment - 24/Apr/08 10:20 PM
GroupByTest.java is quite different in 10.3, making it hard to do an automated merge.

Attached file tenThree.diff is the patch I intend to submit to 10.3, if testing goes well.

Bryan Pendleton added a comment - 25/Apr/08 02:09 PM
Committed tenThree.diff to the 10.3 branch as revision 651612.

Marking the issue as resolved. Please confirm that the fix is working,
and close the issue if the problem is gone. Thanks for helping us
find this problem!

Rick Hillegas added a comment - 06/Aug/08 06:54 PM
Does this issue need a release note? With this fix, a family of queries will now return different results.