Details
Description
According to the Hive Language Manual for UNION ALL:
The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.
Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names.
Reproducible example:
// This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines" def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json("file://" + path).registerTempTable(name) } // Note category vs. cat names of first column tempTable("test_one", """{"category" : "A", "num" : 5}""") tempTable("test_another", """{"cat" : "A", "num" : 5}""") // +--------+---+ // |category|num| // +--------+---+ // | A| 5| // | A| 5| // +--------+---+ // // Instead, an error should have been generated due to incompatible schema ctx.sql("select * from test_one union all select * from test_another").show // Cleanup new File(dataPath("test_one")).delete() new File(dataPath("test_another")).delete()
When the number of columns is different, Spark can even mix in datatypes.
Reproducible example (requires a new spark-shell session):
// This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines" def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json("file://" + path).registerTempTable(name) } // Note test_another is missing category column tempTable("test_one", """{"category" : "A", "num" : 5}""") tempTable("test_another", """{"num" : 5}""") // +--------+ // |category| // +--------+ // | A| // | 5| // +--------+ // // Instead, an error should have been generated due to incompatible schema ctx.sql("select * from test_one union all select * from test_another").show // Cleanup new File(dataPath("test_one")).delete() new File(dataPath("test_another")).delete()
At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator:
scala> ctx.sql("""select * from view_clicks | union all | select * from view_clicks_aug | """) 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks union all select * from view_clicks_aug 15/08/11 02:40:25 INFO ParseDriver: Parse Completed 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug org.apache.spark.sql.AnalysisException: unresolved operator 'Union; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)
Attachments
Issue Links
- contains
-
SPARK-12556 Pyspark dataframe unionAll call accepts incorrect input
- Resolved
- is duplicated by
-
SPARK-9874 UnionAll operation on DataFrame doesn't check for column names
- Resolved
- is related to
-
SPARK-15918 unionAll returns wrong result when two dataframes has schema in different order
- Resolved
- relates to
-
SPARK-9293 Analysis should detect when set operations are performed on tables with different numbers of columns
- Resolved
- links to