Issue Details (XML | Word | Printable)

Key: DERBY-883
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Manish Khettry
Reporter: Lluis Turro
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Derby

Enhance GROUP BY clause to support expressions instead of just column references.

Created: 28/Jan/06 12:35 AM   Updated: 01/Jul/09 12:34 AM
Return to search
Component/s: Documentation, SQL
Affects Version/s: 10.1.2.1, 10.2.1.6
Fix Version/s: 10.2.1.6

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 883.patch.txt 2006-07-17 04:53 AM Manish Khettry 79 kB
Text File Licensed for inclusion in ASF works 883.patch3.txt 2006-07-25 10:59 PM Manish Khettry 78 kB
Text File Licensed for inclusion in ASF works 883.patch4.txt 2006-07-26 08:38 PM Manish Khettry 73 kB
Text File Licensed for inclusion in ASF works 883.patch5.txt 2006-07-28 09:26 PM Manish Khettry 81 kB
Text File Licensed for inclusion in ASF works 883.patch6.txt 2006-08-11 12:47 AM Manish Khettry 80 kB
Text File Licensed for inclusion in ASF works 883.patch6.txt 2006-08-11 12:45 AM Manish Khettry 81 kB
Text File Licensed for inclusion in ASF works 883.patch7.txt 2006-08-25 11:37 PM Manish Khettry 81 kB
Environment: JDK 1.5.0_05
Issue Links:
Reference
 

Resolution Date: 13/Dec/06 08:18 PM


 Description  « Hide
This select would return an error syntax on finding "(" after month if group by clause:

select idissue, month(creation), year(creation), count(distinct idissue)
where
  ....
group by idissue, month(creation), year(creation)

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Lluis Turro added a comment - 28/Jan/06 01:25 AM
Correct SQL statement should be:

select status, month(creation), year(creation), count(distinct idissue)
where
  ....
group by status, month(creation), year(creation)

The fact remains the same, but previous was mistaking.

Mike Perham added a comment - 17/Apr/06 10:59 PM
This bug makes unit testing our data warehousing queries essentially impossible without modifying our actual schema. This needs to be bumped from major to critical IMO.

Matthias Ohlemeyer added a comment - 18/Apr/06 04:24 PM
If I read the documentation correctly (Reference Manual) then your SQL statement is simply not allowed in Derby; it says

GROUP BY column-Name [ , column-Name ] *

where column-Name is a simple column name or a correlation; in my experiences correlations only work when they refer to a simple column.

But IMO you are still right: Derby is missing an essential feature here; instead of the above something like

GROUP BY SelectItem [ , SelectItem ] *

would be more in line with what other DBs have to offer; it should be possible to use arbitrary select items from the SELECT clause in the GROUP BY clause. This "shortcoming" has made it close to impossible to port a complex Oracle based application to Derby.

But since the documentation seems to be clear, I think this "bug" should be reclassified as a "New Feature" or "Improvement"; for me the priority cannot be high enough, it is "Major" at least and of course "Critical" for my application, but I'm not sure this can be generalized.

Daniel John Debrunner added a comment - 06/May/06 07:11 AM
Seems like supporting expressions in the GROUP BY clause is similar to the improvement made to support expressions in the ORDER BY clause that Tomohito did a while ago.

Can we use the knowledge/implementation details from the ORDER BY work to do the GROUP BY improvement?

Manish Khettry added a comment - 09/May/06 12:22 AM
I can look into the fix for 134 and see if the work done there can be used for this one; unless someone else (Tomohito perhaps) wants to look into this first.

Satheesh Bandaram added a comment - 10/May/06 01:52 AM
Thanks for picking this up, Manish. I also have itch to see this completed soon, so let me know if and how I can help.

Manish Khettry added a comment - 16/May/06 07:58 AM
I have a few questions on this, largely functional.

1. In terms of syntax, do we allow expressions in the group by list or positional parameters, or both?

select tomonth(creationdt), toyear(creationdt), count(*)
from bugs
group by 1, 2;

or
group by tomonth(creationdt), toyear(creationdt);

An implementation question on this note-- does the language code have a way of looking at two expressions (ValueNode?) and checking to see if they are equivalent? We'll need some way of doing this to match an expression in the group by list to an expression in the select list right?

2. I assume that an expression in a group by list must appear in the select list without aggregation right? Is this an error?

select x+1, x+2, sum(y)
from test
group by x
;

3. What do we do with duplicates? i.e.

select x+1, x+1, sum(y)
from test
group by x+1, x+1;

Is this an error? The current implementation throws an error if the same column occurs more than once in the group by list.

Is there a standard somewhere which I should consult before trying to nail down the functionality?

Lluis Turro added a comment - 16/May/06 03:17 PM
Manish, it would be great supporting positional parameters. So I vote for both ;-)

Duplicates in group by clause make no sense, is an error.

Lluis Turro added a comment - 16/May/06 03:28 PM
About question 2. Results would not change but would be a shortcut. If you ask me I would tell you that reads better 'group by x'. Remember you can also group fields in where clause and they don't need to appear in the select list.

select sum(quantity)
from client
group by clientId

I got a list of 'quantity' where to perform operations, but don't care which client provides each element.


Satheesh Bandaram added a comment - 18/May/06 01:13 AM
Thanks for making progress on this important new functionality, Manish.

> 1. In terms of syntax, do we allow expressions in the group by list or positional parameters, or both?
>
> select tomonth(creationdt), toyear(creationdt), count(*)
> from bugs
> group by 1, 2;

I have seen positional parameters for ORDER BY expressions, not typically used in GROUP BY. Looking at both DB2 and Oracle documentation, it seems neither support positional parameters.

> An implementation question on this note-- does the language code have a way of looking
> at two expressions (ValueNode?) and checking to see if they are equivalent? We'll need
> some way of doing this to match an expression in the group by list to an expression
> in the select list right?

Correct. Don't think there is any existing expression matching to compare two expressions. DB2 docs discuss how group by expressions are matched in SQL reference manual. (Page 484: ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2s1e81.pdf)

> 2. I assume that an expression in a group by list must appear in the select list without
> aggregation right? Is this an error?
>
> select x+1, x+2, sum(y)
> from test
> group by x

NO... This is a valid query. See the reference I provided above.

> 3. What do we do with duplicates? i.e.
>
> select x+1, x+1, sum(y)
> from test
> group by x+1, x+1;
>
> Is this an error? The current implementation throws an error if the same column
> occurs more than once in the group by list.

I am not sure why Derby currently considers this an error... Looking at the code, it seems it may be looking for ambiguous column references (like 'x' being part of two different tables in from_list), which makes sense, but not sure why duplicate references should be prevented.

> Is there a standard somewhere which I should consult before trying to nail down the functionality?

Unfortunately, NO.... SQL 2003 seems to allow only column references in GROUP BY clause. But both DB2 and Oracle allow expressions in GROUP BY list and likely allowed by other database vendors too. You could use either DB2 or Oracle docs to understand how this functionality is defined there. Much easier to read these docs than confusing SQL 2003 spec.

Manish Khettry added a comment - 17/Jul/06 04:52 AM
Functionally, this patch supports expressions in the group by list. An expression
in the group by list can also be used in the select list. For example if the grouping
expression is c1+c2, then a valid expression in the select list would be
c1+c2+1. The expression matching facility is not smart enough to allow c1+1+c2.

This patch does not allow the use of grouping expressions in the having clause
or order by clause. These are restrictions for now which should be removed
eventually. For more information on group by expressions:

http://www.pdc.kth.se/doc/SP/manuals/db2-5.0/html/db2s0/db2s0176.htm#HDRGRPBY

Another change has been to allow the use of duplicate grouping expressions: i.e
select a, sum(b) from test group by a,a;

Error messages have been changed. The most noticeable change is the use
of the more general sql exception: LANG_INVALID_GROUPED_SELECT_LIST.

VerifyAggregateExpressionsVisitor

o disallows java nodes. This should take care of functions in java.
o the skipChildren method doesn't traverse a subtree that contains
  any grouping expression.
o the error messages are more appropriate-- LANG_INVALID_GROUPED_SELECT_LIST.

SelectNode

o calls preprocess on the group by list. This is needed because expressions
can get rewritten in the select list but not in the grouping list causing
problems when the result set is generated.

GroupByNode

o changes to init() take care of the case where gropuing expressions is
not a column reference.

o the rewrite logic is a bit different now. Earlier, we would change
each unaggregate ColumnReference in the select list and point it to the
GroupByColumn. Now we replace each group by expression that we find
in the projection list with a virtual column node that effectively points
to a result column in the result set doing the group by. In addition
the original routine which did the rewrite is now split up into two separate
and smaller routines: addUnAggColumns and addAggregateColumns.

GroupByColumn

o now uses a ValueNode instead of a ColumnReference. Unused methods zapped.

GroupByDiff

o verifyUniqueGroupingColumn discarded. This was a restriction earlier and
it does not make sense.

o findGroupingColumn now does the hard work of finding a group by column
for a given expression.

sqlgrammer

o use additiveExpression instead of columnReference.

GroupByExpressionTest.java

o JUnit test case for this functionality.






Manish Khettry added a comment - 25/Jul/06 10:59 PM
incorporate feedback from code review.

Manish Khettry added a comment - 26/Jul/06 08:38 PM
I found a problem in one of my changes. Also got rid of a lot of white space only diffs.

A B added a comment - 26/Jul/06 11:49 PM
Review comments incorporated into version 3 and 4 patches can be found here:

  http://article.gmane.org/gmane.comp.apache.db.derby.devel/24480

Review comments on the phase 4 patch can be found here:

  http://article.gmane.org/gmane.comp.apache.db.derby.devel/24708

Yip Ng added a comment - 28/Jul/06 06:28 AM
Hi Manish:

   I am also reviewing your DERBY-883 patch. Since the group by clause has been relaxed to allow expressions now, we'll need to return a proper SQLSTATE for invalid aggregate functions use in group by clause. i.e.: The following test is throwing a NPE:
 
ij> create table t1 (c1 int, c2 varchar(10));
0 rows inserted/updated/deleted
ij> insert into t1 values (1, 'data1'), (2, 'data1'), (3, 'data2');
3 rows inserted/updated/deleted
ij> select sum(c1), 1 from t1 group by sum(c1);
ERROR XJ001: Java exception: ': java.lang.NullPointerException'.
ij> select * from t1;
ERROR 40XT0: An internal error was identified by RawStore module.

   I'll post additional comments after I have review all the changes.

Manish Khettry added a comment - 28/Jul/06 09:26 PM
This patch addresses issues raised by Army and Yip.

1. Flag error for aggregate in group by list. Changes to sqlgrammar.jj, SQLState.java and messages_en.properties. Other language files need to be changed too.

2. Remove unused SQL state (42Y19).

3. Since the junit file (GroupByExpressionTest.java) is new, use spaces instead of tabs to indent. Fix indentation for long lines.

4. Remove commented out code.

5. Add a few comments suggested by Army.

6. Test ternary operator node#isEquivalent in test case.


Yip Ng added a comment - 09/Aug/06 10:37 PM
Thanks for continuing to improve the patch, Manish. The changes look good to me and I think this patch is ready for commit. +1

Daniel John Debrunner added a comment - 09/Aug/06 10:54 PM
Darn - this patch doesn't apply cleanly for me, about 8-10 files have conflicts, otherwise I would have committed it.

Manish Khettry added a comment - 11/Aug/06 12:45 AM
svn update to revision 430199. Hopefully this patch will apply cleanly.


Manish Khettry added a comment - 11/Aug/06 12:47 AM
Minor correction to previous patch.

Manish Khettry added a comment - 11/Aug/06 12:48 AM
Dan could you try the last patch file attached with this bug?

Daniel John Debrunner added a comment - 17/Aug/06 07:02 PM
Sorry I missed this was available, I will look at committing this.

Daniel John Debrunner added a comment - 19/Aug/06 07:07 PM
Manish, can you fix up the patch so it works with the current state of the trunk. BaseJDBCTestCase has moved and the changes to sqlgrammar.jaa and CoalesceFunctionNode do not apply cleanly. If you can get this done this weekend then I will apply it.

Mike Matrigali added a comment - 25/Aug/06 10:19 PM
This looks like a good thing to get into 10.2,.

Daniel John Debrunner added a comment - 25/Aug/06 11:27 PM
I spent some time on this patch and have it merged to trunk - will run tests and then commit.

Manish Khettry added a comment - 25/Aug/06 11:37 PM
Synced to rev 434408. Use the new JUnit test case/test setup classes. Passed derbylang.

Daniel John Debrunner added a comment - 25/Aug/06 11:41 PM
Were there any function changes between patch 6 and 7?

Manish Khettry added a comment - 25/Aug/06 11:56 PM
I'm not sure what you mean by function changes-- one new function has been added to BaseJDBCTestCase (although it was added to the "old" BaseJDBCTestCase as well). The suite method in the GroupByExpressionTest.junit as well as the setUp/tearDown methods have changed slightly, but no new functions anywhere and no methods modified anywhere else.


Daniel John Debrunner added a comment - 26/Aug/06 12:49 AM
Sorry I was on the way to the bus when I asked the question so it was brief.
I meant is the functionality provided by patchs 6 and 7 the same, did you fix any bugs etc. between 6 and 7 or was it just adjusting to the codeline changes. I've already started tests with my merge of 6 so I'll keep with that unless there's added benefit to patch 7.

Manish Khettry added a comment - 26/Aug/06 01:35 AM
no functional changes, so if you've fixed up patch 6 to work, then go with it.

Daniel John Debrunner added a comment - 26/Aug/06 03:56 AM
Patch 883.patch6.txt applied to trunk with changes below as revision 437070.


0) Merged by hand to latest codeline, only one real clash due to other added methods on ColumnReference
1) Licence header changed for new files to match new policy
2) Adapt patch for new location of BaseJDBCTestCase
3) Cleanup of new GroupByExpressionTest to use new single connection model provided by BaseJDBCTestCase and not have a single static Connection field. Static fields will cause a problem if multiple tests are run concurrently.
4) Minor cleanup of new assert in BaseJDBCTestCase


Thanks Manish, especially for sticking with it. :-)

Daniel John Debrunner added a comment - 28/Aug/06 07:45 PM
Change was commited to trunk (10.3)

Kristian Waagan added a comment - 30/Aug/06 02:48 PM
Are there any plans to merge 883.patch6.txt (revision 437070) to 10.2?
I found out that it fixes the NPE with COALESCE (DERBY-1774). I don't know why.

I tried to merge by applying the diff from 'svn diff -r 437069:437070', and that worked fine.

Daniel John Debrunner added a comment - 30/Aug/06 03:02 PM
"worked fine" as in merged without errors, or "worked fine" as in derbyall passed on the branch wih the merge changes?

Kristian Waagan added a comment - 30/Aug/06 04:27 PM
The former; merged without errors.
Was this a general question, or is there reason to believe a merge will cause problems?
I'll start a derbyall tonight, and report back tomorrow.

Kristian Waagan added a comment - 31/Aug/06 09:59 PM
derbyall with the patch 6 merged ran 673 tests on Solaris10, Sun JDK 1.5.
1 failure, which I think might be caused by my test enviroment; Upgrade_10_1_10_2.java.

Daniel John Debrunner added a comment - 01/Sep/06 11:45 PM
Is this issue resolved now?

Manish Khettry added a comment - 02/Sep/06 12:15 AM
I think the issue can be closed.

I'm not sure if we want to allow the use of grouping expressions in the having clause; i.e.

select c+1, count(*)
from t
group by c+1
having c+1 > 1;

I only have access to mysql, which doesn't allow this but does let you do

select c, count(*)
from t
group by c
having c > 1;

If we do, then we need another bug or a subtask of this issue.

Øystein Grøvlen added a comment - 26/Oct/06 12:25 PM
According to the release notes, this is part of 10.2. However, it is not documented in the 10.2 manuals, and the Fix Version for this issue says 10.3. What is correct?

Yip Ng added a comment - 31/Oct/06 05:43 PM
This feature is integrated into 10.2, so the fix version should be 10.2.1.6 and not 10.3.0.0.

Including the Documentation component since it needs to be documented in 10.2 reference manual - GROUP BY clause.

http://db.apache.org/derby/docs/10.2/ref/rrefsqlj32654.html


Yip Ng added a comment - 31/Oct/06 06:30 PM
Some discussions regarding GROUP BY with expression in dev-list:

http://www.nabble.com/Functions-in-GROUP-BY-expressions--%28related-to-DERBY-883%29-tf2517296.html

Andrew McIntyre added a comment - 13/Dec/06 08:18 PM
Marking this resolved. Opened DERBY-2170 to cover the documentation update.

Dyre Tjeldvoll added a comment - 24/Jan/08 01:10 PM
This issue is resolved and has not been updated in the last 12 months (since 24/Jan/07).