Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The following query:
SELECT SUM(DUP_COUNT) FROM ( SELECT COUNT(1) As DUP_COUNT FROM DATACLOUD.DATA_ASSESSMENT_RECORD WHERE JOURNEY_ID='07ixx000000004J' AND DATASOURCE=0 AND MATCH_STATUS <= 1 and ORGANIZATION_ID='07ixx000000004J' GROUP BY MATCH_STATUS, EXTERNAL_DATASOURCE_KEY HAVING COUNT(1) > 1);
Should use an ORDERED DISTINCT, but doesn't.
This is the DDL:
CREATE TABLE DATACLOUD.DATA_ASSESSMENT_RECORD (ORGANIZATION_ID char(15) not null, JOURNEY_ID char(15) not null, DATASOURCE SMALLINT not null, MATCH_STATUS TINYINT not null, EXTERNAL_DATASOURCE_KEY varchar(30), ENTITY_ID char(15) not null, CONSTRAINT PK PRIMARY KEY ( ORGANIZATION_ID, JOURNEY_ID, DATASOURCE, MATCH_STATUS, EXTERNAL_DATASOURCE_KEY, ENTITY_ID))
The optimization does occur if you include the PK columns in the GROUP BY like this:
SELECT SUM(DUP_COUNT) FROM ( SELECT COUNT(1) As DUP_COUNT FROM DATACLOUD.DATA_ASSESSMENT_RECORD WHERE JOURNEY_ID='07ixx000000004J' AND DATASOURCE=0 AND MATCH_STATUS <= 1 and ORGANIZATION_ID='07ixx000000004J' GROUP BY ORGANIZATION_ID, JOURNEY_ID, DATASOURCE, MATCH_STATUS, EXTERNAL_DATASOURCE_KEY HAVING COUNT(1) > 1);
Would you mind reviewing, samarthjain? The problem was that we were determining whether or not a GROUP BY was in row key order before compiling the WHERE clause. We instead need to do this in two steps: first compile the GROUP BY without regard to ordering (so we can also optimize the HAVING). Then recompile the GROUP BY to take into account ordering.