Details
Description
SPARK-18394 introduced a stable ordering in AttributeSet.toSeq using expression IDs (PR-18959) without noticing that AggregateExpression.references used AttributeSet.toSeq as a shortcut (link). The net result is that AggregateExpression.references fails for unresolved aggregate functions.
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
isDistinct = false
).references
fails with
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object, tree: 'y at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) at org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) at org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) at scala.collection.AbstractSeq.sorted(Seq.scala:41) at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) at scala.collection.AbstractSeq.sortBy(Seq.scala:41) at org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
The solution is to avoid calling toSeq as ordering is not important in references and simplify (and speed up) the implementation to something like
mode match { case Partial | Complete => aggregateFunction.references case PartialMerge | Final => AttributeSet(aggregateFunction.aggBufferAttributes) }