[SPARK-6444] SQL functions (either built-in or UDF) should check for data types of their arguments - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0

Description

SQL functions should remain unresolved if their arguments don't satisfy their argument type requirements. Take Sum as an example, the data type of Sum(Literal("1")) is StringType, and now it's considered resolved, which may cause problems.

Here is a simplified version of a problematic query reported by cenyuhai. Spark shell session for reproducing this issue:

import sqlContext._

sql("""
    CREATE TABLE IF NOT EXISTS ut (
        c1 STRING,
        c2 STRING
    )
    """)

sql("""
    SELECT SUM(c3) FROM (
        SELECT SUM(c1) AS c3, 0 AS c4 FROM ut     -- (1)
        UNION ALL
        SELECT 0 AS c3, COUNT(c2) AS c4 FROM ut   -- (2)
    ) t
    """).queryExecution.optimizedPlan

Exception thrown:

java.util.NoSuchElementException: key not found: c3#10
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
        at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:80)
        at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:79)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:186)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:177)
        at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:79)
        at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:101)
        at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:101)
        ...

The analyzed plan of the query is:

== Analyzed Logical Plan ==
!Aggregate [], [SUM(CAST(c3#153, DoubleType)) AS _c0#157]                   (c)
 Union
  Project [CAST(c3#153, StringType) AS c3#164,c4#163L]                      (d)
   Project [c3#153,CAST(c4#154, LongType) AS c4#163L]
    Aggregate [], [SUM(CAST(c1#158, DoubleType)) AS c3#153,0 AS c4#154]     (b)
     MetastoreRelation default, ut, None
  Project [CAST(c3#155, StringType) AS c3#162,c4#156L]                      (a)
   Aggregate [], [0 AS c3#155,COUNT(c2#161) AS c4#156L]
    MetastoreRelation default, ut, None

This case is very interesting. It involves 2 analysis rules, WidenTypes and PromoteStrings, and 1 optimizer rule, UnionPushdown. To see the details, we can turn on TRACE level log and check detailed rule execution process. The TL;DR is:

Since c1 is STRING, SUM(c1) is also STRING (which is the root cause of the whole issue).
c3 in (1) is STRING, while the one in (2) is INT. Thus WidenTypes casts the latter to STRING to ensure both sides of the UNION have the same schema. See (a).
PromoteStrings casts c1 in SUM(c1) to DOUBLE, which consequently changes data type of SUM(c1) and c3 to DOUBLE. See (b).
c3 in the top level Aggregate is resolved as DOUBLE (c)
Since schemas of the two sides of the UNION are different again, WidenTypes casts SUM(c1) AS c3 to STRING. See (d).
Int the top level Aggregate, c3#153 becomes a missing input attribute because it is hidden by (d) now.
In the optimizing phase, UnionPushdown throws because the top level Aggregate has missing input attribute.

Attachments

Issue Links

Is contained by

SPARK-7562 Improve error reporting for expression data type mismatch

Resolved

relates to

SPARK-6452 CheckAnalysis should throw when the Aggregate node contains missing input attribute(s)

Resolved

links to

[Github] Pull Request #5127 (liancheng)

[Github] Pull Request #6152 (cloud-fan)

[Github] Pull Request #6405 (cloud-fan)

Activity

People

Assignee:: Wenchen Fan

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Mar/15 04:03

Updated:: 03/Jun/15 07:48

Resolved:: 03/Jun/15 07:48