[SPARK-9813] Incorrect UNION ALL behavior - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.1
Fix Version/s: 1.5.0
Component/s: Spark Core, SQL
Labels:
- sql
- union
Environment:

Ubuntu on AWS

Target Version/s:

1.5.1, 1.6.0

Description

According to the Hive Language Manual for UNION ALL:

The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.

Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names.

Reproducible example:

// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json("file://" + path).registerTempTable(name)
}

// Note category vs. cat names of first column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"cat" : "A", "num" : 5}""")

//  +--------+---+
//  |category|num|
//  +--------+---+
//  |       A|  5|
//  |       A|  5|
//  +--------+---+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show

// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()

When the number of columns is different, Spark can even mix in datatypes.

Reproducible example (requires a new spark-shell session):

// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json("file://" + path).registerTempTable(name)
}

// Note test_another is missing category column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"num" : 5}""")

//  +--------+
//  |category|
//  +--------+
//  |       A|
//  |       5| 
//  +--------+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show

// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()

At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator:

scala> ctx.sql("""select * from view_clicks
     | union all
     | select * from view_clicks_aug
     | """)
15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
union all
select * from view_clicks_aug
15/08/11 02:40:25 INFO ParseDriver: Parse Completed
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)

Attachments

Issue Links

contains

SPARK-12556 Pyspark dataframe unionAll call accepts incorrect input

Resolved

is duplicated by

SPARK-9874 UnionAll operation on DataFrame doesn't check for column names

Resolved

is related to

SPARK-15918 unionAll returns wrong result when two dataframes has schema in different order

Resolved

relates to

SPARK-9293 Analysis should detect when set operations are performed on tables with different numbers of columns

Resolved

links to

[Github] Pull Request #7631 (JoshRosen)

Incorrect UNION ALL behavior

Details

Description

Attachments

Issue Links

Activity

People

Dates