Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6444

SQL functions (either built-in or UDF) should check for data types of their arguments

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.2, 1.1.1, 1.2.1, 1.3.0
    • 1.5.0
    • SQL
    • None

    Description

      SQL functions should remain unresolved if their arguments don't satisfy their argument type requirements. Take Sum as an example, the data type of Sum(Literal("1")) is StringType, and now it's considered resolved, which may cause problems.

      Here is a simplified version of a problematic query reported by cenyuhai. Spark shell session for reproducing this issue:

      import sqlContext._
      
      sql("""
          CREATE TABLE IF NOT EXISTS ut (
              c1 STRING,
              c2 STRING
          )
          """)
      
      sql("""
          SELECT SUM(c3) FROM (
              SELECT SUM(c1) AS c3, 0 AS c4 FROM ut     -- (1)
              UNION ALL
              SELECT 0 AS c3, COUNT(c2) AS c4 FROM ut   -- (2)
          ) t
          """).queryExecution.optimizedPlan
      

      Exception thrown:

      java.util.NoSuchElementException: key not found: c3#10
              at scala.collection.MapLike$class.default(MapLike.scala:228)
              at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
              at scala.collection.MapLike$class.apply(MapLike.scala:141)
              at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
              at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:80)
              at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:79)
              at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
              at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
              at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
              at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:186)
              at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:177)
              at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:79)
              at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:101)
              at org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:101)
              ...
      

      The analyzed plan of the query is:

      == Analyzed Logical Plan ==
      !Aggregate [], [SUM(CAST(c3#153, DoubleType)) AS _c0#157]                   (c)
       Union
        Project [CAST(c3#153, StringType) AS c3#164,c4#163L]                      (d)
         Project [c3#153,CAST(c4#154, LongType) AS c4#163L]
          Aggregate [], [SUM(CAST(c1#158, DoubleType)) AS c3#153,0 AS c4#154]     (b)
           MetastoreRelation default, ut, None
        Project [CAST(c3#155, StringType) AS c3#162,c4#156L]                      (a)
         Aggregate [], [0 AS c3#155,COUNT(c2#161) AS c4#156L]
          MetastoreRelation default, ut, None
      

      This case is very interesting. It involves 2 analysis rules, WidenTypes and PromoteStrings, and 1 optimizer rule, UnionPushdown. To see the details, we can turn on TRACE level log and check detailed rule execution process. The TL;DR is:

      1. Since c1 is STRING, SUM(c1) is also STRING (which is the root cause of the whole issue).
      2. c3 in (1) is STRING, while the one in (2) is INT. Thus WidenTypes casts the latter to STRING to ensure both sides of the UNION have the same schema. See (a).
      3. PromoteStrings casts c1 in SUM(c1) to DOUBLE, which consequently changes data type of SUM(c1) and c3 to DOUBLE. See (b).
      4. c3 in the top level Aggregate is resolved as DOUBLE (c)
      5. Since schemas of the two sides of the UNION are different again, WidenTypes casts SUM(c1) AS c3 to STRING. See (d).
      6. Int the top level Aggregate, c3#153 becomes a missing input attribute because it is hidden by (d) now.
      7. In the optimizing phase, UnionPushdown throws because the top level Aggregate has missing input attribute.

      Attachments

        Issue Links

          Activity

            People

              cloud_fan Wenchen Fan
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: