Details
Description
SQL functions should remain unresolved if their arguments don't satisfy their argument type requirements. Take Sum as an example, the data type of Sum(Literal("1")) is StringType, and now it's considered resolved, which may cause problems.
Here is a simplified version of a problematic query reported by cenyuhai. Spark shell session for reproducing this issue:
import sqlContext._
sql("""
CREATE TABLE IF NOT EXISTS ut (
c1 STRING,
c2 STRING
)
""")
sql("""
SELECT SUM(c3) FROM (
SELECT SUM(c1) AS c3, 0 AS c4 FROM ut -- (1)
UNION ALL
SELECT 0 AS c3, COUNT(c2) AS c4 FROM ut -- (2)
) t
""").queryExecution.optimizedPlan
Exception thrown:
java.util.NoSuchElementException: key not found: c3#10 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:80) at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:79) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:177) at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:79) at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:101) at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:101) ...
The analyzed plan of the query is:
== Analyzed Logical Plan == !Aggregate [], [SUM(CAST(c3#153, DoubleType)) AS _c0#157] (c) Union Project [CAST(c3#153, StringType) AS c3#164,c4#163L] (d) Project [c3#153,CAST(c4#154, LongType) AS c4#163L] Aggregate [], [SUM(CAST(c1#158, DoubleType)) AS c3#153,0 AS c4#154] (b) MetastoreRelation default, ut, None Project [CAST(c3#155, StringType) AS c3#162,c4#156L] (a) Aggregate [], [0 AS c3#155,COUNT(c2#161) AS c4#156L] MetastoreRelation default, ut, None
This case is very interesting. It involves 2 analysis rules, WidenTypes and PromoteStrings, and 1 optimizer rule, UnionPushdown. To see the details, we can turn on TRACE level log and check detailed rule execution process. The TL;DR is:
- Since c1 is STRING, SUM(c1) is also STRING (which is the root cause of the whole issue).
- c3 in (1) is STRING, while the one in (2) is INT. Thus WidenTypes casts the latter to STRING to ensure both sides of the UNION have the same schema. See (a).
- PromoteStrings casts c1 in SUM(c1) to DOUBLE, which consequently changes data type of SUM(c1) and c3 to DOUBLE. See (b).
- c3 in the top level Aggregate is resolved as DOUBLE (c)
- Since schemas of the two sides of the UNION are different again, WidenTypes casts SUM(c1) AS c3 to STRING. See (d).
- Int the top level Aggregate, c3#153 becomes a missing input attribute because it is hidden by (d) now.
- In the optimizing phase, UnionPushdown throws because the top level Aggregate has missing input attribute.
Attachments
Issue Links
- Is contained by
-
SPARK-7562 Improve error reporting for expression data type mismatch
- Resolved
- relates to
-
SPARK-6452 CheckAnalysis should throw when the Aggregate node contains missing input attribute(s)
- Resolved
- links to