[SPARK-22266] The same aggregate function was evaluated multiple times - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

We should avoid the same aggregate function being evaluated more than once, and this is what has been stated in the code comment below (patterns.scala:206). However things didn't work as expected.

      // A single aggregate expression might appear multiple times in resultExpressions.
      // In order to avoid evaluating an individual aggregate function multiple times, we'll
      // build a set of the distinct aggregate expressions and build a function which can
      // be used to re-write expressions so that they reference the single copy of the
      // aggregate function which actually gets computed.

For example, the physical plan of

SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a

was

HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
+- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
   +- SerializeFromObject [assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true]).b AS b#24]
      +- Scan ExternalRDDScan[obj#22]

, where in each HashAggregate there were two identical aggregate functions "max(b#24 + 1)".

Attachments

Issue Links

links to

[Github] Pull Request #19488 (maryannxue)

Activity

People

Assignee:: Wei Xue

Reporter:: Wei Xue

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 12/Oct/17 17:42

Updated:: 18/Oct/17 13:14

Resolved:: 18/Oct/17 13:13