|
Simple :
CREATE TABLE tabla_name(field varchar(10), field2 varchar(10 )); INSERT INTO tabla_name VALUES('first', 'first'); INSERT INTO tabla_name VALUES('first', 'second'); SELECT DISTINCT field FROM tabla_name GROUP BY field, field2. I expect only one 1 row, but I get two rows. Why?? I believe you get two rows because the results have been partitioned
into two groups, as you specified in the GROUP BY. Within each group you are getting only one distinct value of 'field', but there are two groups total. Derby might be interpreting the standards incorrectly here, I don't know.
Do you have access to the SQL standard to help us understand what the specified behavior should be? I starting from the page
http://en.wikipedia.org/wiki/SQL and next http://en.wikipedia.org/wiki/SQL-92 and next open the page http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt. In the specyfication I found that "If DISTINCT is specified, then let TXA be the result of elimi- nating redundant duplicate values from TX. Otherwise, let TXA be TX." So, in JavaDB is not ok :) I tried the query with different versions of Derby:
10.0.1.2: two rows 10.1.1.0: two rows 10.1.2.1: one row 10.2.1.6: one row 10.2.2.0: one row 10.3.1.4: two rows 10.3.2.1: two rows 10.4.1.1 (RC): two rows So, given that one row is the correct result, the bug was fixed at some point, but it regressed back later. The first change (two rows --> one row) happened in this commit:
------------------------------------------------------------------------ r267239 | bandaram | 2005-09-02 20:07:39 +0200 (Fri, 02 Sep 2005) | 6 lines Thanks to Rick for persistant reviews and running the tests. Submitted by Knut Anders Hatlen (Knut.Hatlen@Sun.COM) ------------------------------------------------------------------------ The old behaviour (one row --> two rows) was reintroduced in this commit: ------------------------------------------------------------------------ r516454 | abrown | 2007-03-09 17:37:20 +0100 (Fri, 09 Mar 2007) | 8 lines This patch preserves the having clause through bind and optimize phases and then, during the final rewrite for aggregates in the GroupByNode, it transforms the having clause to a valid restriction. See text file attached to the Jira for more information. Contributed by Manish Khettry (manish_khettry@yahoo.com) ------------------------------------------------------------------------ This page suggests that certain versions of Sybase may have similar behavior to Derby:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.ase_15.0.sqlug/html/sqlug/sqlug126.htm It seems like Derby could recognize that: select distinct a from t group by a, b is identical to select distinct a from t group by a I can think of 3 possible ways to give Derby this behavior: 1) When processing GROUP by columns, and deciding whether to pull them up into the result column list as generated columns, detect this particular case and don't pull up the generated GROUP BY column "b" into the result list. 2) In the compilation process, prior to generating the distinct scan result set, detect that un-necessary column "b" has been included into the result column list and skip it when generating the data structures for execution. 3) At execution time, when setting up the sorter, recognize that column "b" isn't needed in the sort key, and only sort/collapse the records on column "a". It seems best to detect this situation earlier rather than later, so I'm partial to solutions which can solve the problem at compilation time, not execution time. Attached is 'dontGenerateForDistinct.diff', a patch proposal.
This patch modifies the processing of GROUP BY columns which are "generated" into the select's result column list. Specifically, the patch causes columns to be generated into the select RCL only if the select does *not* specify DISTINCT. If the select specifies DISTINCT, we don't want to include any additional columns into the RCL because we want to be sure that we only perform DISTINCT processing on the columns that were specified by the user. The patch also includes some additional test cases, based on the repro case in the issue description. I ran a complete set of regression tests with the modified code and it passed all the existing tests, as well as the new tests added by this patch. Please have a look and let me know what you think. The approach taken in the proposed patch looks good. Test changes seem to cover the failing scenarios well.
Running a few tests on the patch now, but unless they show any problems I'm +1 to commit. No errors seen in my testing - so +1 for commit.
Thanks for the review and testing Thomas!
I updated the patch with (a) a short comment on the code change, and (b) a few other test cases which I created by looking at some of the other test cases in GroupByTest and adding DISTINCT to them. The revised patch is attached. Committed addCommentAndAnotherTest.diff to the trunk as revision 650728.
I'll investigate merging this patch back to the 10.4 and 10.3 lines. Merged the trunk change to the 10.4 branch as revision 651275.
GroupByTest.java is quite different in 10.3, making it hard to do an automated merge.
Attached file tenThree.diff is the patch I intend to submit to 10.3, if testing goes well. Committed tenThree.diff to the 10.3 branch as revision 651612.
Marking the issue as resolved. Please confirm that the fix is working, and close the issue if the problem is gone. Thanks for helping us find this problem! Does this issue need a release note? With this fix, a family of queries will now return different results.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
which shows the specific tables, statements, and data you are using,
the results you get, and the results you expect?