Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Result set returned by Hive has one row instead of 5
select avg(distinct tsint.csint) over () from tsint create table if not exists TSINT (RNUM int , CSINT smallint) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE; 0|\N 1|-1 2|0 3|1 4|10
Attachments
Attachments
- HIVE-9534.4.patch
- 15 kB
- Aihua Xu
- HIVE-9534.3.patch
- 14 kB
- Aihua Xu
- HIVE-9534.2.patch
- 14 kB
- Aihua Xu
- HIVE-9534.1.patch
- 13 kB
- Aihua Xu
Issue Links
- links to
Activity
While Postgres may not support ISO-SQL syntax as shown other vendors do. That said, if the engine does not attempt to support distinct aggregation then a parsing error should be thrown and the documentation improved at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
I tried the tests (distinct with window function) in MySQL, PostgreSQL and Oracle with following steps:
create table testwindow (col1 int, col2 int); insert into testwindow values (1, 1); insert into testwindow values (1, 2); insert into testwindow values (1, 3); insert into testwindow values (2, 1); insert into testwindow values (2, 2); insert into testwindow values (3, 3); --- select avg(distinct col1) over() from testwindow;
MySQL: did not work and got the error msg: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '() from testwindow' at line 1
PostgreSQL: did not work and got the error msg: ERROR: DISTINCT is not implemented for window functions Position: 8
Oracle: seemed work but I wonder if it is right, the average is right ( average of 1, 2, 3) with 6 rows:
1 2 2 2 3 2 4 2 5 2 6 2
Hive only returns one row but with correct average: 2
rhbutani I have not looked deeply into code yet, do you think the feature like distinct with window function has not been supported in Hive at this moment, or may it be a bug? Is it related to HIVE-10586?
Actually as of Oracle 11.2 (the version I tested was Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production), the window function is not supported. Though the query "select avg(distinct col1) over() from testwindow;", where avg acts on all rows, works, query with windowing_clause does actually not work. for example:
Query:
select avg(distinct col1) over(order by col2 rows between 1 preceding and 1 following) from testwindow;
---
Error:
ORA-30487: ORDER BY not allowed here
30487. 00000 - "ORDER BY not allowed here"
*Cause: DISTINCT functions and RATIO_TO_REPORT cannot have an ORDER BY
Based on Oracle document http://docs.oracle.com/cd/E11882_01/server.112/e25554/analysis.htm
Note that the DISTINCT keyword is not supported in windowing functions except for MAX and MIN.
Based on Hive plan for "select avg(distinct col1) over() from testwindow;" it looks like Hive was computing the distinct value of col1:
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: testwindow Statistics: Num rows: 6 Data size: 18 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: col1 (type: int) outputColumnNames: col1 Statistics: Num rows: 6 Data size: 18 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: avg(DISTINCT col1) keys: col1 (type: int) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 6 Data size: 18 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Statistics: Num rows: 6 Data size: 18 Basic stats: COMPLETE Column stats: NONE Reduce Operator Tree: Group By Operator aggregations: avg(DISTINCT KEY._col0:0._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
Looks like that the over(analytic_clause) part will be ignored in a query with distinct in Hive:
function @init { gParent.pushMsg("function specification", state); } @after { gParent.popMsg(state); } : functionName LPAREN ( (STAR) => (star=STAR) | (dist=KW_DISTINCT)? (selectExpression (COMMA selectExpression)*)? ) RPAREN (KW_OVER ws=window_specification)? -> {$star != null}? ^(TOK_FUNCTIONSTAR functionName $ws?) -> {$dist == null}? ^(TOK_FUNCTION functionName (selectExpression+)? $ws?) -> ^(TOK_FUNCTIONDI functionName (selectExpression+)?) ;
the query like:
select avg(distinct col1) over() from testwindow; or
select avg(distinct col1) over(order by col2 rows between 1 preceding and 1 following) from testwindow;
the over(...) is totally ignored.
So I am going to fix this issue by throwing out unsupported error.
re your comment about ORACLE
select avg(distinct tsint.csint) over () from tsint
null, -1, 0, 1, 10
ORACLE Oracle Database 12c Enterprise Edition ( 12.1.0.2.0) returns 2.5,
2.5, 2.5, 2.5, 2.5
Oracle 11.2 treats avg(distinct tsint.csint) over () as analytic function instead of aggregation function, so the query return 4 rows of returns 2.5. Note, there is not order by clause or window clause inside the parenthesis of "over". Could you try query like "select avg(distinct tsint.csint) over (order by rnum rows between 1 preceding and 1 following) from tsint" to see if it works in Oracle c12? It did not work in 11.2.
IBM Netezza 7.2, IBM Big Insights, SAP Hana SP10, ORACLE 12 etc all return the expected result.
Attach Patch-1: this is to support AVG(DISTINCT) OVER (PARTITION BY) format. Similar to SUM() and COUNT(), it's to skip the same value when DISTINCT is present.
Code review: https://reviews.apache.org/r/43192/. Somehow I can't link to that.
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12786290/HIVE-9534.2.patch
SUCCESS: +1 due to 1 test(s) being added or modified.
ERROR: -1 due to 18 failed/errored test(s), 10052 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_windowing_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_lineage3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ptfgroupbyjoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_quotedid_basic org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_notin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_unqualcolumnrefs org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_windowing_streaming org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_1 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_subquery_in org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_windowing_gby org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_uri_import org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_invalid_avg_syntax org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query67 org.apache.hadoop.hive.cli.TestPerfCliDriver.testPerfCliDriver_query70 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_subquery_in org.apache.hive.jdbc.TestSSL.testSSLVersion
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/6883/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/6883/console
Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-6883/
Messages:
Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 18 tests failed
This message is automatically generated.
ATTACHMENT ID: 12786290 - PreCommit-HIVE-TRUNK-Build
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12786846/HIVE-9534.3.patch
SUCCESS: +1 due to 1 test(s) being added or modified.
ERROR: -1 due to 5 failed/errored test(s), 10039 tests executed
Failed tests:
TestSparkCliDriver-timestamp_lazy.q-bucketsortoptimize_insert_4.q-date_udf.q-and-12-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_uri_import org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_invalid_avg_syntax org.apache.hadoop.hive.ql.TestTxnCommands2.testInitiatorWithMultipleFailedCompactions org.apache.hive.jdbc.TestSSL.testSSLVersion
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/6920/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/6920/console
Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-6920/
Messages:
Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed
This message is automatically generated.
ATTACHMENT ID: 12786846 - PreCommit-HIVE-TRUNK-Build
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12787127/HIVE-9534.4.patch
SUCCESS: +1 due to 1 test(s) being added or modified.
ERROR: -1 due to 4 failed/errored test(s), 9738 tests executed
Failed tests:
TestMiniTezCliDriver-auto_sortmerge_join_13.q-tez_self_join.q-schema_evol_text_nonvec_mapwork_table.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-dynpart_sort_optimization2.q-tez_bmj_schema_evolution.q-vector_char_mapjoin1.q-and-12-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_uri_import org.apache.hive.jdbc.TestSSL.testSSLVersion
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/6939/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/6939/console
Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-6939/
Messages:
Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed
This message is automatically generated.
ATTACHMENT ID: 12787127 - PreCommit-HIVE-TRUNK-Build
It seems that your method need a global sorting first, then group the values to have a right answer. Not As what I thought: each node sort and group its own values, then merge (which need combine arrays). Does your method do global sorting first?
As the Avg(distinct) is calculated after global sort, the fix is a good one.
+1
Thanks Yongzhi. Yeah. The current implementation will do a sort on the partitioning columns so each partition data in the reducer is sorted and we only need to check previous row against the current row to find if it's distinct or not.
Does this need to be backported to 2.0.1? Seems like a serious bug
That could be considered as new feature rather than bug, I guess. Since technically we don't support distinct in windowing function before. Probably we should not backport to 2.0.1.
Doc query: the6campbells said in the second comment on this issue:
While Postgres may not support ISO-SQL syntax as shown other vendors do. That said, if the engine does not attempt to support distinct aggregation then a parsing error should be thrown and the documentation improved at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
So aihuaxu, what documentation changes are needed now that this issue is resolved?
leftylev I just updated the doc.
I also filed HIVE-13453 to track the improvement to address the current limitation.
Applying a similar SQL in PostgreSQL or Impala returns an error of the form "DISTINCT is not implemented for window functions". Hive, unless it did add proper support for distinct in such a context, should likely output the same error (if not a bug-fix).