|
This bug makes unit testing our data warehousing queries essentially impossible without modifying our actual schema. This needs to be bumped from major to critical IMO.
If I read the documentation correctly (Reference Manual) then your SQL statement is simply not allowed in Derby; it says
GROUP BY column-Name [ , column-Name ] * where column-Name is a simple column name or a correlation; in my experiences correlations only work when they refer to a simple column. But IMO you are still right: Derby is missing an essential feature here; instead of the above something like GROUP BY SelectItem [ , SelectItem ] * would be more in line with what other DBs have to offer; it should be possible to use arbitrary select items from the SELECT clause in the GROUP BY clause. This "shortcoming" has made it close to impossible to port a complex Oracle based application to Derby. But since the documentation seems to be clear, I think this "bug" should be reclassified as a "New Feature" or "Improvement"; for me the priority cannot be high enough, it is "Major" at least and of course "Critical" for my application, but I'm not sure this can be generalized. Seems like supporting expressions in the GROUP BY clause is similar to the improvement made to support expressions in the ORDER BY clause that Tomohito did a while ago.
Can we use the knowledge/implementation details from the ORDER BY work to do the GROUP BY improvement? I can look into the fix for 134 and see if the work done there can be used for this one; unless someone else (Tomohito perhaps) wants to look into this first.
Thanks for picking this up, Manish. I also have itch to see this completed soon, so let me know if and how I can help.
I have a few questions on this, largely functional.
1. In terms of syntax, do we allow expressions in the group by list or positional parameters, or both? select tomonth(creationdt), toyear(creationdt), count(*) from bugs group by 1, 2; or group by tomonth(creationdt), toyear(creationdt); An implementation question on this note-- does the language code have a way of looking at two expressions (ValueNode?) and checking to see if they are equivalent? We'll need some way of doing this to match an expression in the group by list to an expression in the select list right? 2. I assume that an expression in a group by list must appear in the select list without aggregation right? Is this an error? select x+1, x+2, sum(y) from test group by x ; 3. What do we do with duplicates? i.e. select x+1, x+1, sum(y) from test group by x+1, x+1; Is this an error? The current implementation throws an error if the same column occurs more than once in the group by list. Is there a standard somewhere which I should consult before trying to nail down the functionality? Manish, it would be great supporting positional parameters. So I vote for both ;-)
Duplicates in group by clause make no sense, is an error. About question 2. Results would not change but would be a shortcut. If you ask me I would tell you that reads better 'group by x'. Remember you can also group fields in where clause and they don't need to appear in the select list.
select sum(quantity) from client group by clientId I got a list of 'quantity' where to perform operations, but don't care which client provides each element. Thanks for making progress on this important new functionality, Manish.
> 1. In terms of syntax, do we allow expressions in the group by list or positional parameters, or both? > > select tomonth(creationdt), toyear(creationdt), count(*) > from bugs > group by 1, 2; I have seen positional parameters for ORDER BY expressions, not typically used in GROUP BY. Looking at both DB2 and Oracle documentation, it seems neither support positional parameters. > An implementation question on this note-- does the language code have a way of looking > at two expressions (ValueNode?) and checking to see if they are equivalent? We'll need > some way of doing this to match an expression in the group by list to an expression > in the select list right? Correct. Don't think there is any existing expression matching to compare two expressions. DB2 docs discuss how group by expressions are matched in SQL reference manual. (Page 484: ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2s1e81.pdf) > 2. I assume that an expression in a group by list must appear in the select list without > aggregation right? Is this an error? > > select x+1, x+2, sum(y) > from test > group by x NO... This is a valid query. See the reference I provided above. > 3. What do we do with duplicates? i.e. > > select x+1, x+1, sum(y) > from test > group by x+1, x+1; > > Is this an error? The current implementation throws an error if the same column > occurs more than once in the group by list. I am not sure why Derby currently considers this an error... Looking at the code, it seems it may be looking for ambiguous column references (like 'x' being part of two different tables in from_list), which makes sense, but not sure why duplicate references should be prevented. > Is there a standard somewhere which I should consult before trying to nail down the functionality? Unfortunately, NO.... SQL 2003 seems to allow only column references in GROUP BY clause. But both DB2 and Oracle allow expressions in GROUP BY list and likely allowed by other database vendors too. You could use either DB2 or Oracle docs to understand how this functionality is defined there. Much easier to read these docs than confusing SQL 2003 spec. Functionally, this patch supports expressions in the group by list. An expression
in the group by list can also be used in the select list. For example if the grouping expression is c1+c2, then a valid expression in the select list would be c1+c2+1. The expression matching facility is not smart enough to allow c1+1+c2. This patch does not allow the use of grouping expressions in the having clause or order by clause. These are restrictions for now which should be removed eventually. For more information on group by expressions: http://www.pdc.kth.se/doc/SP/manuals/db2-5.0/html/db2s0/db2s0176.htm#HDRGRPBY Another change has been to allow the use of duplicate grouping expressions: i.e select a, sum(b) from test group by a,a; Error messages have been changed. The most noticeable change is the use of the more general sql exception: LANG_INVALID_GROUPED_SELECT_LIST. VerifyAggregateExpressionsVisitor o disallows java nodes. This should take care of functions in java. o the skipChildren method doesn't traverse a subtree that contains any grouping expression. o the error messages are more appropriate-- LANG_INVALID_GROUPED_SELECT_LIST. SelectNode o calls preprocess on the group by list. This is needed because expressions can get rewritten in the select list but not in the grouping list causing problems when the result set is generated. GroupByNode o changes to init() take care of the case where gropuing expressions is not a column reference. o the rewrite logic is a bit different now. Earlier, we would change each unaggregate ColumnReference in the select list and point it to the GroupByColumn. Now we replace each group by expression that we find in the projection list with a virtual column node that effectively points to a result column in the result set doing the group by. In addition the original routine which did the rewrite is now split up into two separate and smaller routines: addUnAggColumns and addAggregateColumns. GroupByColumn o now uses a ValueNode instead of a ColumnReference. Unused methods zapped. GroupByDiff o verifyUniqueGroupingColumn discarded. This was a restriction earlier and it does not make sense. o findGroupingColumn now does the hard work of finding a group by column for a given expression. sqlgrammer o use additiveExpression instead of columnReference. GroupByExpressionTest.java o JUnit test case for this functionality. incorporate feedback from code review.
I found a problem in one of my changes. Also got rid of a lot of white space only diffs.
Review comments incorporated into version 3 and 4 patches can be found here:
http://article.gmane.org/gmane.comp.apache.db.derby.devel/24480 Review comments on the phase 4 patch can be found here: http://article.gmane.org/gmane.comp.apache.db.derby.devel/24708 Hi Manish:
I am also reviewing your ij> create table t1 (c1 int, c2 varchar(10)); 0 rows inserted/updated/deleted ij> insert into t1 values (1, 'data1'), (2, 'data1'), (3, 'data2'); 3 rows inserted/updated/deleted ij> select sum(c1), 1 from t1 group by sum(c1); ERROR XJ001: Java exception: ': java.lang.NullPointerException'. ij> select * from t1; ERROR 40XT0: An internal error was identified by RawStore module. I'll post additional comments after I have review all the changes. This patch addresses issues raised by Army and Yip.
1. Flag error for aggregate in group by list. Changes to sqlgrammar.jj, SQLState.java and messages_en.properties. Other language files need to be changed too. 2. Remove unused SQL state (42Y19). 3. Since the junit file (GroupByExpressionTest.java) is new, use spaces instead of tabs to indent. Fix indentation for long lines. 4. Remove commented out code. 5. Add a few comments suggested by Army. 6. Test ternary operator node#isEquivalent in test case. Darn - this patch doesn't apply cleanly for me, about 8-10 files have conflicts, otherwise I would have committed it.
svn update to revision 430199. Hopefully this patch will apply cleanly.
Minor correction to previous patch.
Dan could you try the last patch file attached with this bug?
Sorry I missed this was available, I will look at committing this.
Manish, can you fix up the patch so it works with the current state of the trunk. BaseJDBCTestCase has moved and the changes to sqlgrammar.jaa and CoalesceFunctionNode do not apply cleanly. If you can get this done this weekend then I will apply it.
This looks like a good thing to get into 10.2,.
I spent some time on this patch and have it merged to trunk - will run tests and then commit.
Synced to rev 434408. Use the new JUnit test case/test setup classes. Passed derbylang.
Were there any function changes between patch 6 and 7?
I'm not sure what you mean by function changes-- one new function has been added to BaseJDBCTestCase (although it was added to the "old" BaseJDBCTestCase as well). The suite method in the GroupByExpressionTest.junit as well as the setUp/tearDown methods have changed slightly, but no new functions anywhere and no methods modified anywhere else.
Sorry I was on the way to the bus when I asked the question so it was brief.
I meant is the functionality provided by patchs 6 and 7 the same, did you fix any bugs etc. between 6 and 7 or was it just adjusting to the codeline changes. I've already started tests with my merge of 6 so I'll keep with that unless there's added benefit to patch 7. no functional changes, so if you've fixed up patch 6 to work, then go with it.
Patch 883.patch6.txt applied to trunk with changes below as revision 437070.
0) Merged by hand to latest codeline, only one real clash due to other added methods on ColumnReference 1) Licence header changed for new files to match new policy 2) Adapt patch for new location of BaseJDBCTestCase 3) Cleanup of new GroupByExpressionTest to use new single connection model provided by BaseJDBCTestCase and not have a single static Connection field. Static fields will cause a problem if multiple tests are run concurrently. 4) Minor cleanup of new assert in BaseJDBCTestCase Thanks Manish, especially for sticking with it. :-) Change was commited to trunk (10.3)
Are there any plans to merge 883.patch6.txt (revision 437070) to 10.2?
I found out that it fixes the NPE with COALESCE ( I tried to merge by applying the diff from 'svn diff -r 437069:437070', and that worked fine. "worked fine" as in merged without errors, or "worked fine" as in derbyall passed on the branch wih the merge changes?
The former; merged without errors.
Was this a general question, or is there reason to believe a merge will cause problems? I'll start a derbyall tonight, and report back tomorrow. derbyall with the patch 6 merged ran 673 tests on Solaris10, Sun JDK 1.5.
1 failure, which I think might be caused by my test enviroment; Upgrade_10_1_10_2.java. Is this issue resolved now?
I think the issue can be closed.
I'm not sure if we want to allow the use of grouping expressions in the having clause; i.e. select c+1, count(*) from t group by c+1 having c+1 > 1; I only have access to mysql, which doesn't allow this but does let you do select c, count(*) from t group by c having c > 1; If we do, then we need another bug or a subtask of this issue. According to the release notes, this is part of 10.2. However, it is not documented in the 10.2 manuals, and the Fix Version for this issue says 10.3. What is correct?
This feature is integrated into 10.2, so the fix version should be 10.2.1.6 and not 10.3.0.0.
Including the Documentation component since it needs to be documented in 10.2 reference manual - GROUP BY clause. http://db.apache.org/derby/docs/10.2/ref/rrefsqlj32654.html Some discussions regarding GROUP BY with expression in dev-list:
http://www.nabble.com/Functions-in-GROUP-BY-expressions--%28related-to-DERBY-883%29-tf2517296.html Marking this resolved. Opened DERBY-2170 to cover the documentation update.
This issue is resolved and has not been updated in the last 12 months (since 24/Jan/07).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
select status, month(creation), year(creation), count(distinct idissue)
where
....
group by status, month(creation), year(creation)
The fact remains the same, but previous was mistaking.