[SPARK-20686] PropagateEmptyRelation incorrectly handles aggregate without grouping expressions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.1.2, 2.2.0
Component/s: Optimizer, SQL
Labels:
- correctness

Description

The query

SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1

should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.

This is caused by ~~SPARK-16208~~, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:

An aggregate with non-empty group expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the SELECT statement includes aggregate expressions since that won't affect the number of output rows.

If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.

The current implementation is incorrect (since it returns a wrong answer) and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions.

Attachments

Issue Links

relates to

SPARK-16208 Add `PropagateEmptyRelation` optimizer

Resolved

links to

[Github] Pull Request #17929 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/May/17 00:47

Updated:: 11/May/17 19:59

Resolved:: 10/May/17 06:42