|
I notice 3 interesting things when I run your query. First, here is the output I get:
ij> select b, count(*) from yy where a=5 or a=2 group by b order by count(*) asc; B |2 |3 ---------------------------------------------- 3.0 |4 |4 4.0 |1 |1 7.0 |1 |1 Note that there are *three* columns in the result set, not two. Since there were only two columns in your select list, that seems wrong to me. Secondly, the output is ordered by column B, not by column count(*), so that seems wrong to me, too. Lastly, I took a look at the query plan using -Dderby.language.logQueryPlan=true, and I've included it below. Note that there is only *one* sort executed in this query plan, although it seems that there should have been two sorts: - one to order the rows by column "b" in order to perform the grouping. - one to take the grouped results and re-sort them by the count(*) column It seems that the optimizer didn't understand that the desired ordering of the results was different than the implied ordering of the GROUP BY column. Here is the query plan: 2007-12-02 16:29:03.707 GMT Thread[main,5,main] (XID = 183), (SESSIONID = 0), se lect b, count(*) from yy where a=5 or a=2 group by b order by count(*) asc ***** ** Project-Restrict ResultSet (5): Number of opens = 1 Rows seen = 3 Rows filtered = 0 restriction = false projection = true constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 restriction time (milliseconds) = 0 projection time (milliseconds) = 0 optimizer estimated row count: 1.00 optimizer estimated cost: 30.90 Source result set: Grouped Aggregate ResultSet: Number of opens = 1 Rows input = 6 Has distinct aggregate = false In sorted order = false Sort information: Number of rows input=6 Number of rows output=3 Sort type=internal constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 optimizer estimated row count: 1.20 optimizer estimated cost: 30.90 Source result set: Project-Restrict ResultSet (4): Number of opens = 1 Rows seen = 6 Rows filtered = 0 restriction = false projection = true constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 restriction time (milliseconds) = 0 projection time (milliseconds) = 0 optimizer estimated row count: 1.20 optimizer estimated cost: 30.90 Source result set: Project-Restrict ResultSet (3): Number of opens = 1 Rows seen = 7 Rows filtered = 1 restriction = true projection = false constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 restriction time (milliseconds) = 0 projection time (milliseconds) = 0 optimizer estimated row count: 1.20 optimizer estimated cost: 30.90 Source result set: Table Scan ResultSet for YY at read committed isolation level us ing instantaneous share row locking chosen by the optimizer Number of opens = 1 Rows seen = 7 Rows filtered = 0 Fetch Size = 16 constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 next time in milliseconds/row = 0 scan information: Bit set of columns fetched=All Number of columns fetched=2 Number of pages visited=1 Number of rows qualified=7 Number of rows visited=7 Scan type=heap start position: null stop position: null qualifiers: None optimizer estimated row count: 1.20 optimizer estimated cost: 30.90 Hi Bryan,
sorry that I did not made myself clear in the first post, but as you have pointed out, the sorting of the count(*) column is not correct, which is the main problem here. The ASC and DESC is ignored. The phantom column which appears as the last row is not an issue, but might become in an other application. The wrong execution plan is propably caused by the wrong sorting issue. Expected results: select b, COUNT(*) AS "COUNT_OF" from yy where a = 5 or a = 2 group by b order by COUNT(*) ASC B |COUNT_OF | 3 ---------------------------------------------- 4.0 |1 |1 7.0 |1 |1 3.0 |4 |4 select b, COUNT(*) AS "COUNT_OF" from yy where a = 5 or a = 2 group by b order by COUNT(*) DESC B |COUNT_OF | 3 ---------------------------------------------- 3.0 |4 |4 4.0 |1 |1 7.0 |1 |1 Looks like a regression caused by
http://svn.apache.org/viewvc?view=rev&revision=516454 Interestingly if you leave out the where clause, the query works just fine.
ij> select b, count(*) from yy group by b order by count(*) asc; B |2 ---------------------------------- 4.0 |1 7.0 |2 3.0 |4 A.B, did this query work before svn revision 516454 and fail after this commit?
The problem was missing methods isConstant and isConstantExpression for AggregateNodes. The super class implementation would return true causing Derby to eliminate the order by.
isConstantExpression is used to eliminate sorts in queries like this: select c1, c2 from t where c1=2 order by c1; In this case, we have an aggregate as the sort column and this sort of optimization will apply to queries like this: select c1, count(*) from t group by c1 having count(*) = 2 order by count(*); To remove constant sort columns the code in SelectNode#preProcess (around line 915) will have to be modified to consider having clauses as well. For now, I return false for both methods-- it is more important to get correct results for this query first. > The problem was missing methods isConstant and isConstantExpression for
> AggregateNodes. The super class implementation would return true causing > Derby to eliminate the order by. Thank you for looking into this, Manish! Two follow-up questions based on what you have found, just for my own understanding: 1) If possible, can you explain what part of the 2) Any idea why removing the "WHERE" clause from the original query causes it to return correct results, even without the order_by_bug.diff patch, as stated in your December 4th comment? I admit I haven't done any tracing myself, but from your previous comment I can't quite see the connection between the WHERE clause and the ORDER BY, since the WHERE clause does not reference the aggregate "count(*)". Apologies if I'm missing something obvious... Thanks again for your willingness to investigate this further. Have you had a chance to run any of the regression tests with your patch applied? I ran the junit lang suite.
1. Prior to 681, the query would get rewritten with a subselect so the aggregate node in the order by would be transformed to a column reference. 2. The optimization is only done if whereClause != null (SelectNode around line 910 or so) Thank you for the reply, Manish. I'm running the full regression tests with this patch to make sure that everything is okay; I'll report back with the results later.
One thing I did notice is that this change makes it so that ALL aggregate nodes now return "false" for constantExpression(PredicateList). Is that perhaps going too far? For example, in the queries: select b, max(a) from yy where a=5 group by b order by max(a) asc; select b, max(a) from yy where a=5 group by b order by max(a) desc; the value "MAX(A)" is in fact constant, and will be recognized as such prior to your patch, thus avoiding the need to actually sort the results for "ORDER BY". But with the patch applied we'll end up sorting the results unnecessarily. I definitely agree with your statement that "it is more important to get correct results ... first", but I wonder if it might be worth it to try to preserve the optimization when possible? I can't imagine there are many people out there who would issue an ORDER BY on a constant MAX aggregate, so maybe this isn't worth addressing; but executing the above two queries before and after your patch does show that, with the patch, we are doing an extra unnecessary sort. So if it's not too difficult, I think it would be nice to preserve the "constant" nature of aggregates like MAX and MIN... Any thoughts one way or the other? Well, I thought it was better to fix the wrong results and look at optimizations later on as a separate bug. I pointed out another kind of query in an earlier comment where we could do this "sort avoidance" thing.
A slightly more abstract question-- how important is it for Derby to do optimizations like this? The downsides are code complexity and unanticipated bugs like this one while the benefits are restricted to queries, which as you put it, "you can't imagine too many people writing"? Hi Manish, thanks very much for working on this problem.
Did you happen to investigate the "phantom column" behavior that I saw in IJ? As Peter noted, the extra column did not appear to cause any serious problems, but I'm wondering why it appeared. I agree with your assessment that the highest priority is for the query to return the right results. But I also think it's hard to predict what sort of odd queries the DBMS will see; particularly in this age of persistence layers and code generators, the strangest sort of SQL seems to get generated and thrown at the system. Given that the optimizer's job is to optimize, I think that we should pay attention to every situation where there is a possible optimization. So, as a possible proposal, could we: - include the various "unusual" queries that you and Army came up with as test cases in the regression test suite, so that at least we continue to increase the overall population of "interesting" queries in our test scripts, and - file a separate JIRA issue (or perhaps several) noting the various cases of potential further optimization, so that we can pursue these later as time permits? I did not look into the "phantom column" behavior; just getting rid of the faulty logic which assumed the sort column was a constant expression did the trick.
+1 on your proposal. manish> I thought it was better to fix the wrong results and look at optimizations later on as a separate bug.
bryanp> I agree with your assessment that the highest priority is for the query to bryanp> return the right results. Okay, so it looks like there is agreement here :) I just wanted to bring up the fact that, in getting one query to return correct results, we're effectively *disabling* an *existing* optimization for another query that currently works correctly. To say "get it working, then improve it" is perfectly okay; to say "get it working but regress other optimizations, then fix those other optimizations later" is also probably okay, but perhaps slightly more risky as a general principle. In any event, I'm +1 to Bryan's proposal given the apparent unlikelihood of the to-be-disabled optimization (ex. for MAX()) showing up in a user's environment. manish> +1 on your proposal. Manish, are you volunteering to implement the proposal, i.e. to 1) add more test cases to the regression suite, and 2) file another Jira? My derbyall and suites.All with the patch ran cleanly (only failure was the known failure of outerjoin.sql, which is unrelated and has since been fixed). So I can commit order_by_bug.diff now and the additional test cases can perhaps be added as a separate patch/commit. Does that sound okay? I added a single test case to the original patch to also check that a sort in ascending order works (the original patch only checked desc), so I'm attaching the patch with that change as d3231_v2.patch.
Committed with svn # 603954: http://svn.apache.org/viewvc?rev=603954&view=rev Thanks again to Manish for picking this one up and offering a solution in such quick time. manish have you had a chance to look at
thanks manish, i have been out on vacation. Let me know if there is anything I can do to help on this issue. Don't really know much about the code but am happy to run/build test cases and try against different versions of the code.
Can this issue be marked resolved?
I think I'll port this to 10.3 since it is a regression fix. Please let me know if you see any problem with porting it to 10.3.
Marking this issue resolved. Please reopen if there is still more work to do. It looked fixed as far as I could tell. I ported the fix to 10.3 with revison 615042.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
the columns of those rows? The rows are in the wrong order? What results
did you expect to get, and how do they differ from the actual results?