Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9813

Incorrect UNION ALL behavior

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.1
    • 1.5.0
    • Spark Core, SQL
    • Ubuntu on AWS

    Description

      According to the Hive Language Manual for UNION ALL:

      The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.

      Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names.

      Reproducible example:

      // This test is meant to run in spark-shell
      import java.io.File
      import java.io.PrintWriter
      import org.apache.spark.sql.hive.HiveContext
      import org.apache.spark.sql.SaveMode
      
      val ctx = sqlContext.asInstanceOf[HiveContext]
      import ctx.implicits._
      
      def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
      
      def tempTable(name: String, json: String) = {
        val path = dataPath(name)
        new PrintWriter(path) { write(json); close }
        ctx.read.json("file://" + path).registerTempTable(name)
      }
      
      // Note category vs. cat names of first column
      tempTable("test_one", """{"category" : "A", "num" : 5}""")
      tempTable("test_another", """{"cat" : "A", "num" : 5}""")
      
      //  +--------+---+
      //  |category|num|
      //  +--------+---+
      //  |       A|  5|
      //  |       A|  5|
      //  +--------+---+
      //
      //  Instead, an error should have been generated due to incompatible schema
      ctx.sql("select * from test_one union all select * from test_another").show
      
      // Cleanup
      new File(dataPath("test_one")).delete()
      new File(dataPath("test_another")).delete()
      

      When the number of columns is different, Spark can even mix in datatypes.

      Reproducible example (requires a new spark-shell session):

      // This test is meant to run in spark-shell
      import java.io.File
      import java.io.PrintWriter
      import org.apache.spark.sql.hive.HiveContext
      import org.apache.spark.sql.SaveMode
      
      val ctx = sqlContext.asInstanceOf[HiveContext]
      import ctx.implicits._
      
      def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
      
      def tempTable(name: String, json: String) = {
        val path = dataPath(name)
        new PrintWriter(path) { write(json); close }
        ctx.read.json("file://" + path).registerTempTable(name)
      }
      
      // Note test_another is missing category column
      tempTable("test_one", """{"category" : "A", "num" : 5}""")
      tempTable("test_another", """{"num" : 5}""")
      
      //  +--------+
      //  |category|
      //  +--------+
      //  |       A|
      //  |       5| 
      //  +--------+
      //
      //  Instead, an error should have been generated due to incompatible schema
      ctx.sql("select * from test_one union all select * from test_another").show
      
      // Cleanup
      new File(dataPath("test_one")).delete()
      new File(dataPath("test_another")).delete()
      

      At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator:

      scala> ctx.sql("""select * from view_clicks
           | union all
           | select * from view_clicks_aug
           | """)
      15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
      union all
      select * from view_clicks_aug
      15/08/11 02:40:25 INFO ParseDriver: Parse Completed
      15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
      15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
      15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
      15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
      15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
      15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
      15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
      15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
      org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
      	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
      	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
      	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              simeons Simeon Simeonov
              Votes:
              5 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: